Astrum Imber

Amandeep Singh

"Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language" - pandas.pydata.org

Usually I use the Pandas package in Python to explore and process the datasets that I need for a project. When I start a new Data Analytics project, many steps in the procedure of model building are always the same. This means that some of the code is reused across different projects, and I'll be honest -- either I don't always manage to remember these code snippets correctly, or I am too lazy to type them again. So, to move things quicker, I made a cheatsheet of these frequently used snippets/commands to use in my projects.

Note:

- I don't always use all of the commands listed in this post. But these code snippets are the ones that I have generally used the most.

- I am not a code-genius. In fact, I don't know most things when I start a project. So, like any other coder, I go on Stack Overflow or the official documentation whenever in doubt. Most of these commands have their roots in either a generous Stack Overflow answer or the documentation, and later I modified them to suit the needs of my projects. I have tried to link the original source code wherever I can.

- Some of the commands may be considered very basic. That is exactly the reason why I have included them in the list - they get used a lot!

- The snippets in this post are probably not the best ways to achieve what is needed, but they work for me and so I have shared them here.

- Importing the Pandas library:

import pandas as pd

- Reading Files:

Enter the full directory name in parantheses to read a csv, xlsx or json file using Pandas.

# reading csv file
df1 = pd.read_csv("folder1/file_name1.csv")

# reading excel file
df2 = pd.read_excel("folder2/file_name2.xlsx")

# reading json file
df3 = pd.read_json("folder3/file_name3.json")

- Creating a DataFrame:

There are a few ways to create a DataFrame. The choice of method depends on the data available at hand. If individual lists of data are available, then the second method is simpler.

# making a DataFrame
rev_vs_year = {'Year':['2017-2018', '2018-2019', '2019-2020'], 
                       'Total Revenue':[2207852, 2677619, 3454255]}
rev_vs_year = pd.DataFrame(rev_vs_year)

# making a DataFrame - another method
Years = ['2017-2018', '2018-2019', '2019-2020']
yearly_rev = [2207852, 2677619, 3454255]
column_names = ['Years', 'Total Revenue']
list_rev_vs_year = list(zip(Years, yearly_rev))
rev_vs_year = pd.DataFrame(list_rev_vs_year, 
      columns = column_names)

- Getting an overview of the DataFrame:

To look at the first/last five rows of a DataFrame:

df.head()   # to look at first five rows
df.tail()   # to look at last five rows

When there are many columns, df.head() cuts them off and doesn't show the full header of the DataFrame. In such cases, it might be useful to look at the transpose of the first five rows to have a glance at all the columns. This command won't transpose the DataFrame, and is just for viewing purposes.

df.head().T   # to look at first five rows - transposed
df.tail().T   # to look at last five rows - transposed

To look at a sequence of all the columns in the DataFrame:

df.columns

If a list of all the columns are needed, then the above command can be modified.

df.columns.tolist()

To get a list of all the unique values in a given column:

df.Column_Name.unique()        # either
df["Column Name"].unique()     # or

Getting the number of unique values in a column:

df.Column_Name.unique().shape        # either
df["Column Name"].unique().shape     # or

Looking at the DataFrame info:

df.info()

Generating descriptive statistics for the DataFrame:

df.describe()

Unique values in a column:

df["Column"].value_counts()

- Renaming columns in a DataFrame:

Sometimes the imported file has column names that do not work with the flow of the project. In such cases, the columns can be renamed with the help of the rename command. If the column names are final, then "inplace = True" is used to make the decision permanent.

df.rename(columns = {'Before1':'After1',
            'Before2' : 'After2'
           }, 
     inplace = True)

If a column does not have a name and instead has an index (it can happen sometimes, for example while transposing multiple rows), it can still be renamed. The index of the column can be used (without the inverted-commas) in the rename command.

df.rename(columns = {0:'Total number of Customers'}, 
          inplace = True)

- Renaming a cell value in a DataFrame:

df = df.replace({'Column' : 'Old Value'}, 'New Value')

- Indexing:

If a DataFrame does not have a column for row-indices, they can be added using the simple command:

df.reset_index(inplace=True)

On the other hand, if the DataFrame already has a column for row-indices but they are not usable (can happen after deleting or rearranging rows), the existing index column can be dropped using "drop = True" and then a new index column can be added.

df.reset_index(drop=True, inplace=True)

Sometimes, an already existing column (or columns) is also used as an index for the DataFrame.

df.set_index('column name')   # for single column
df.set_index(['column1', 'column2']) # for multiple columns

- Dropping Columns in a DataFrame:

Often it is the case that a DataFrame has columns that are not needed in the project. In such cases, the columns are dropped from the DataFrame.

df.drop(['Column Name'], axis=1, inplace=True) #using column name
df.drop(df.index[3:12], inplace=True)  #using column index

- Dropping Rows in a DataFrame:

Similar to columns, the rows in a DataFrame can also be dropped.

df.drop(df.index[[1,3]])

However, if the first 'n' rows have to be dropped, using iloc is a much efficient way.

df = df.iloc[3:]    # Delete the first three rows of a DataFrame
df = df.iloc[n:]    # n drops the first n rows

- Exporting a DataFrame:

# to a csv file
df.to_csv('Name_of_output_file_here.csv', index=False)

#to an excel file
df.to_excel('Name_of_output_file_here.xlsx', index=False)

- Checking the percentage of missing values: [SOURCE]

def printNullValues(df):
    total = df.isnull().sum().sort_values(ascending = False)
    total = total[df.isnull().sum().sort_values(ascending = False) != 0]
    percent = total * 100 / len(df)
    percent = percent[df.isnull().sum().sort_values(ascending = False) != 0]
    concat = pd.concat([total, percent], axis=1, keys=['Total','Percent'])
    print (concat)
    print ( "-------------")
    
printNullValues(name_of_dataframe)

- Counting the null values in a column:

print(df['Column'].isnull().values.sum())

- Counting all the null values in a DataFrame:

print(df.isnull().values.sum())

- Replacing missing values:

Using fillna(), the missing values can be replaced by any value.

df['Column'] = df['Column'].fillna(0) # replacing with zero
df['Column'] = df['Column'].fillna(100) # replacing with 100

- Converting the Date column from float to datetime data-type:

df['Date'] =  pd.to_datetime(df['Date'])

- Extracting Dates:

Extracting year from a 'Date' column and saved in new column named 'Year'.

df['Year'] = pd.DatetimeIndex(df['Date']).year

Extracting month from a 'Date' column and saved in new column named 'Month'.

df['Month'] = pd.DatetimeIndex(df['Date']).month

Extracted year-month from a 'Date' column and saved in new column named 'YYYYMM'.

df['YYYYMM'] = pd.DatetimeIndex(df['Date']).to_period('M')

- Replacing missing Date values:

Missing date columns mean that data for that entry was not recorded.

df['Date'] = df['Date'].fillna(pd.Timedelta(seconds=0))

But this will replace the NaT values with 0/0/0 00:00:00, which is not useful. Replacing those null values with a dummy date is a better option.

dummy_date = pd.to_datetime('2020-4-20')
df['Date'] = df['Date'].fillna(dummy_date)

- Sorting the DataFrame:

The DataFrame can be sorted in ascending/descending order by value of given column (or columns)

df.sort_values(by=['Column'], inplace=True, 
        ascending=True, axis=0) # ascending order

df.sort_values(by=['Column'], inplace=True, 
        ascending=False, axis=0) # descending order

df.sort_values(by=['Column1, Column2'], inplace=True, 
        ascending=True, axis=0) # using multiple columns

- Grouping the DataFrame:

The groupby operation splits the DataFrame object into groups according to the column (or columns) provided and then combines the results according to the method mentioned.

# rearranges the DataFrame according to the number of elements
# in the column name 'Column'.
df.groupby('Column').count()

# rearranges the DataFrame according to the mean of elements
# in the column name 'Column'.
df.groupby('Column').mean()

# rearranges the DataFrame according to the size of elements
# in the column name 'Column'.
df.groupby('Column').size()

- Selecting specific rows using a mask/filter:

# selecting rows where row entry is "row1"
df.loc[["row1"]]

# selecting rows that match values in a specific column
df[df["Column"].isin(["Value1", "Value2", "Value3"])]

# for a single value, a simpler form can also be used
df.loc[df["Column"] == "Value"]

Example:

df[df.Country == 'Ireland']

# selecting rows by imposing a condition on a specific column
df[df["Column"] > value]

Example:

df[df["Revenue"] > 50000]