Reading CSV files in Apache Spark

We lately acquired a small cluster of machines to run Hadoop on them. We’re running the Cloudera distribution with an installed instance of Apache Spark. Since I found quite a liking to Python, I’m heavily using PySpark. I’ve worked quite a bit with Pandas, but Pandas does not easily support multiprocessing (I’ve tried some easy workarounds, but they’re just that: workarounds). While I found that the PySpark DataFrames can do similar things as the Pandas DataFrames, I could not find an easy and robust way to read CSV files with huge loads of highly faulty user data (such as newlines

Continue reading »