Friday, 24 July 2015


Python and Hadoop project puts data scientists first


Scientists and mathematicians have long loved Python as a vehicle for working with data and automation. Python has not lacked for libraries such as Hadoopy or Pydoop to work with Hadoop, but those libraries are designed more with Hadoop users in mind than data scientists proper.
Cloudera's new project, Ibis, is an open source (Apache licensed) data analysis framework meant to span the gap. It provides "comprehensive support for the built-in analytic capabilities in Impala for simplified ETL, data wrangling, and data analysis," as Cloudera puts it.
To that end, IBIS seems as much about providing data-science Pythonistas with an automated avenue into Cloudera's Impala framework (a SQL-querying system for Hadoop) as it is about working with Hadoop. (Cloudera engineers Wes McKinney and Marcel Korrnacker describe Ibis as "providing a high level Python front-end for Hadoop rather than providing low-level access to a computation model like MapReduce or Spark.")
That said, it's not hard to see how using Python to work with Impala would allow for new kinds of data-exploration automation. Cloudera CEO Mike Olson describedImpala's utility as a two-way street: "You can run queries [with Impala] that create results that you then MapReduce. You can use MapReduce to analyze data that you then query [with Impala]."
For now, Ibis is offered only as a preview, and Cloudera has hinted at the project's eventual evolution. "Upcoming versions," stated the project's press release, "will allow users to leverage the full range of Python packages as well as author their own Python functions."
Hadoop has been a Java-centric enterprise since the beginning, meaning anyone with a Python-centric workflow has been forced to deal with the framework at arm's length or greater. What'll be key is whether Ibis can in time provide a general soup-to-nuts framework for using Python with Hadoop -- both inside and outside of Impala.