When it comes to data science these days, the two reigning open source software languages are R and Python. Commercial data science products (like SAS, MatLab, Mathematica, etc.) apparently ceded turf to open source alternatives over the last few years, most likely as a result of ... well, all the benefits that open source software entails.
R has its roots in a language devoted to statistics and mathematics, while Python was created as a general purpose language, and then people wrote modules that added the data science capabilities. I’m not a data scientist, but having used R and Python, I much prefer Python because to me it brings the best of both worlds: general purpose, familiar programming constructs, python web frameworks, and the data science and visualization capabilities. If I’m going to be building data science tools for the web, I can do it all in Python: data manipulation, web server, ETL, etc.
One evolving and compelling space in data science tooling is the area of web-based notebooks. The notebook motif has been around in commercial tools for a while, like in Mathematica notebooks. (A notebook presents a linear sequence of executable code and the results of intermediate evaluations of selected lines, along with visualizations.) Notebooks initially were only built in graphical toolkits that run in a special thick client. But a new framework called iPython Notebooks puts the notebook on a web server. So instead of the python executable running on your laptop or workstation, it’s running on the server and deliverying an interactive shell via the browser. This among other things brings the user the ability to easily share their notebooks with others. Or notebooks can be exported in a variety of formats for read-only perusing.
Consider the scenario where you’re on a small team of data scientists and you’re developing some approaches to a problem. It’s early in the exploration phase and you’re going to be seeking input from your coworkers on your approach. In this case you could start your work in a notebook. This allows you to:
Present your findings at a coworker’s desk or anywhere at all if you expose the notebook server to the outside world (Jupyter uses password protection)
Allow someone to clone the notebook (this is done via a JSON file format) and pursue their own line of inquiry with your code
Export to HTML or PDF and email the notebook to someone that you don’t wish to give access to the server
Sample of a Notebook:
Here are some other examples of notebooks.
iPython Notebook capability was initially part of the iPython codebase, but once people saw its general utility, it was extracted out and refactored so that others languages (like Ruby) could take advantage of the work that was done to create notebooks, which are in principle language-agnostic. The new platform is called Jupyter.
Related commercial alternatives
The Wakari platform from Continuum Analytics is a commercial implementation of notebooks with extra capabilities added in. Continuum is also known for the Anaconda tools which are widely used for Python package management. Wakari supports both cloud and on-premise installation. It appears that this company is well-positioned to offer a good experience with any Python related tooling.
Sense.io appears to be tackling a larger space by supporting four languages and adding features for modeling, job scheduling, etc.
You be the judge of which platform might be suitable for your team if you choose to go beyond the base open-source Jupyter offering.
- SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering
- NumPy is the fundamental package for scientific computing with Python
- Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools
- Getting Started with Python for Data Science (Kaggle)
- Python for Data Analysis (book)