Sunday, August 23, 2020

CSV data storage type

Fig. 11.1, [Comma separated value (CSV) file formatted to the RFC ...

Comma-separated values (CSV) are the most flexible data storage type. For each row, the column information is separated with a comma. The comma is not the only type of delimiter, however. Some files are delimited by a tab (TSV) or even a semicolon. The main reason why CSVs are a preferred data format when collaborating and sharing data is because any program can open this kind of data structure. It can even be opened in a text editor.

The Series and DataFrame have a to_csv method to write a CSV file. The documentation for Series and DataFrame identifies many different ways you can modify the resulting CSV file. For example, if you
wanted to save a TSV file because there are commas in your data, you can change the sep parameter.

# save a series into a CSV
names.to_csv('../output/scientist_names_series.csv')
# save a dataframe into a TSV,
# a tab-separated value
scientists.to_csv('../output/scientists_df.tsv', sep='\t')

Removing Row Numbers From Output 

If you open the CSV or TSV file created, you will notice that the first “column” looks like the row number of the dataframe. Many times this is not needed, especially when you are collaborating with other people. Keep in mind that this “column” is really saving the “row label,” which may be important. The documentation will show that there is an index parameter with which to write row names (index). 

# do not write the row names in the CSV output
scientists.to_csv('../output/scientists_df_no_index.csv', index=False)

Importing CSV Data

Importing CSV files operation uses the pd.read_csv function. In the read_csv documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html, you can see there are various ways to read in a CSV.

In the next post we'll see how to store data in Excel, which is probably the most commonly used data type.

Share:

0 comments:

Post a Comment