Accessing Data in Distributed Environment from Non-Distributed Systems

rukmani vaithianathan
2 min readJul 17, 2021

Often times, the distributed data processing is not supported by traditional applications, however the processed data can be consumed through different means.

Pickling is one such option in python, which helps to transfer the serialised objects over network, or to store them in a file and de-serialise for later consumption.

Serialisation is the process of converting in-memory objects — python’s dataframes, data structures (not all), classes and functions to a pickled file.

Pickle over JSON:

It is strictly dependent on the use case and the architecture. JSON is language independent, however python’s pickle provide easy re-construction of the objects.

Why to use Pickling:

To store intermediate results for later use.

To transfer objects between different systems over network.

To persist python objects.

How to Pickle:

Spark’s dataframe can be pickled on its rdd function, saveAsPickleFile()

df_csv.rdd.saveAsPickleFile("/pickling//pickled_file/”)

This will create pickled files under the HDFS directory specified.

Note: Alternately, files can be pickled and stored locally from Spark application, to be consumed by external systems with no hadoop and spark binaries. This is demonstrated in the example below.

Consume Pickled Files from System with no Hadoop and Spark Binaries:

Though pandas support read_pickle, spark’s pickle objects are not compatible to be directly converted into pandas dataframe.

One of the workarounds is

To use python library — sparkpickle

Generate pyspark.sql.row type objects using sparkpickle’s load_gen

Convert the list of row objects to pandas dataframe.

Example : Accessing Spark’s Dataframe from Spark and Hadoop Unsupported System

  1. Convert CSV from HDFS into Spark Dataframe

2. Pickle Spark Dataframe:

3. Convert Pickled Files to Pandas Dataframe from a System with no Hadoop and Spark Binaries:

Required Python Libraries:

1.Pandas

2. sparkpickle

Complete example can be found from the github repo — https://github.com/RukmaniVaithy/pickle_spark_to_pandas/blob/main/Picking%20in%20Spark%20and%20Pandas.ipynb

When Not to Pickle:

Transferring and accessing pickled files is recommended only for communication between the internal systems in an org. It is possible to send malicious objects as pickles and hence not secure to receive and deserialise objects from external networks.

Disclaimer: This disclaimer informs readers that the views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author’s employer, organization, committee or other group or individual.

--

--