We lately acquired a small cluster of machines to run Hadoop on them. We’re running the Cloudera distribution with an installed instance of Apache Spark. Since I found quite a liking to Python, I’m heavily using PySpark. I’ve worked quite a bit with Pandas, but Pandas does not easily support multiprocessing (I’ve tried some easy workarounds, but they’re just that: workarounds). While I found that the PySpark DataFrames can do similar things as the Pandas DataFrames, I could not find an easy and robust way to read CSV files with huge loads of highly faulty user data (such as newlines in URLs or lame “hacking” attempts such as incorporating double dashes in usernames or commas in tags), as there is in Pandas:
myCsvDataFrame = pandas.read_csv(myCsvFile, header=0, ...)
After a bit of searching the Web, I found the spark-csv package for Spark. With this, I only have to define a SQLContext from my SparkContext and load the CSV file, while using the spark-csv package as source and I’m done:
sqlContext = SQLContext(sc) mySparkDataFrame = sqlContext.load(myCsvFile, source="com.databricks.spark.csv", header=True)
To be able to use this package, we have to add it to the pyspark-shell as follows:
pyspark --packages com.databricks:spark-csv_2.11:1.0.3
It even supports reading the first line of the csv file as a header and thus renaming the columns, though I couldn’t figure out yet how to rename the columns by myself.