So far, we have been importing data. It is also common practice to export or save out data sets while processing them. Data sets are either saved out as final cleaned versions of data or in intermediate steps.
Both of these outputs can be used for analysis or as input to another part of the data processing pipeline.
Python has a way to pickle data. This is Python’s way of serializing and saving data in a binary format reading pickle data is also backwards compatible.
In Series
Many of the export methods for a Series are also available for a DataFrame. Those who have experience with numpy will know that a save method is available for ndarrays. This method has been deprecated, and the replacement is to use the to_pickle method.
names = scientists['Name']
print(names)
# pass in a string to the path you want to save
names.to_pickle('../output/scientists_names_series.pickle')
The pickle output is in a binary format. Thus, if you try to open it in a text editor, you will see a bunch of garbled characters. If the object you are saving is an intermediate step in a set of calculations that you want to save, or if you know that your data will stay in the Python world, saving objects to a pickle will be optimized for Python as well as in terms of disk storage space. However, this approach means that people who do not use Python will not be able to read the data.
In DataFrame
The same method can be used on DataFrame objects.
scientists.to_pickle('../output/scientists_df.pickle')
To read in pickle data, we can use the pd.read_pickle function.
# for a Series
scientist_names_from_pickle = pd.read_pickle(
'../output/scientists_names_series.pickle')
print(scientist_names_from_pickle)
# for a DataFrame
scientists_from_pickle = pd.read_pickle(
'../output/scientists_df.pickle')
print(scientists_from_pickle)
The pickle files are saved with an extension of .p, .pkl, or .pickle.
So this is how we use pickle for Exporting and Importing Data, in the next post we will see how to use CSV for data storage.
0 comments:
Post a Comment