Import pandas as pd: an essential Python library for data scientists. Once you import it, you can take your data analytics to a whole new level.
As a general-purpose programming language, Python has all the features needed to analyze and gain insights from data. For this reason, interviewers often expect future data scientists to be proficient in Python.
While it has its own advantages, such as readability and flexibility, the standard Python language is insufficient to perform serious data analysis. Fortunately, it supports external libraries to extend the basic functionality of Python. Pandas is one of the libraries that extends the functionality of the language beyond basic features and provides a toolkit for performing common data science tasks.
Pandas in
Data Science Pandas
is an essential tool for doing data science in Python. It allows you to perform many tasks that would be almost impossible with the standard language: working with data in various formats (tabular or labeled), sorting and formatting it, combining data from multiple sources, finding and cleaning messy data, and even visualizing it.
The Pandas API is easy to follow, especially if you already know SQL. The functions are well documented and descriptively named. Many of the features have the same name as their counterparts in SQL.
If the job you’re applying for requires some knowledge of Python, you’ll have to make a lot of use of panda library functions. Interviewers can ask you questions to gauge your ability to make use of all the features of the panda library.
What can pandas do?
The short answer is that you need Pandas to get access to many functions to perform data analysis in Python. It is especially useful for doing data modeling, data analysis, and data manipulation.
Pandas provides two Series and DataFrame data structures, which give you the flexibility to work with datasets in different formats, such as tabular, time series and matrix data. The DataFrame provides the basis for performing essential operations such as aligning data, calculating statistics, segmenting, grouping, merging, adding, or splitting datasets. It is useful for working with three-dimensional data from tables.
For example, the sort_values() and sort_index() functions are commonly used to sort and organize data. merge() allows you to combine and work with data from two tables, similar to JOINs in SQL. You can even specify the shared dimension and method for combining data. Concat() function for joining strings and arrays.
The read_csv() and read_excel() function allows you to work with data from Excel spreadsheets. Calling the df.head() function will return the first five rows by default. You can provide an argument to specify the number of rows that you want it to return.
The functions of the panda library can be used to obtain information about the available data. df.shape() returns information about the number of columns and rows. df.size() returns the number of rows multiplied by the number of columns. df.info() can give us information about the value types for each column.
Functions such as df.to_datetime(), df.to_period(), dt.year(), dt.month() and the like are very useful for working with datetime values.
The library also contains features for cleaning up messy data. You can configure conditions, limits, and formatting rules to detect outliers. You can also define how to correct erroneous data. Doing this will help you eliminate unrealistic values, which can improve the accuracy of your data analysis.
How to import
pandas as pd
in Python
Installing Python alone is not enough to import the Pandas library, you need to install the library itself. It only supports Python version 2.7 or higher.
Newer versions of Python (3.4 and higher) include the installation of pip, a tool for installing packages such as the Pandas library. If you are running an older version of Python, you can install pip manually or download the latest version of Python on https://www.python.org.
Once you have configured pip, open the command prompt and type the following command:pip install pandas After the installation is complete, the command prompt
will display some kind of success message, such as: Successfully installed pandas-0.24.2
You can also download
and install Pandas
without using the command prompt. Manually download and install the Anaconda software, which also automatically installs Pandas.
Note: During the installation process, it is essential to tick the checkbox for the ‘Add Anaconda to my PATH environment variable’ option.
Once Pandas is installed, you can import pandas as pd and its main library, numpy: import pandas as pd import numpy as
np
We can simply type ‘import pandas’ to make this module available. However, it is good practice to use the ace keyword to give it an abbreviated name ‘pd’. Still, this is not a requirement and you can simply type ‘import pandas’.
In this example, we also imported numpy, another data science library, which is often used with pandas.
Examples of issues
where you need to import pandas as pd
The panda library provides essential features for performing data analysis in Python. For this reason, solutions to Python interview questions are often based on features from the Pandas library.
Question 1 – Find user purchases
Solving this question boils down to working with purchase dates to find cases where a single user ordered twice within a 7-day period. The functions of the panda library will be essential to find an answer.
Link to question: https://platform.stratascratch.com/coding/10322-finding-user-purchases
Available
data
The first step in finding an answer is to look at the available data. In this case, we only have one amazon_transactions table with five columns.
Let’s preview the actual
data in the table:
Step-by-step code solution
Step 1: Import pandas as pd in Python
The panda and Numpy libraries provide an essential toolset for performing complex data science tasks
.
Almost all the functions we use to answer this question come from the panda library.
Import pandas as PD Import numpy as NP from DateTime Import datetime
The Datetime library is useful for working with
date values
Step 2: Convert
date values First, we
need to convert the values in the created_at column to the datetime type and format them
.
For that, we use the to_datetime() and dt.strftime() functions of the pandas library.
pd.to_datetime() an argument, a DataFrame dataset that must be converted to a datetime object. dt.strftime() is used to specify the date format.
amazon_transactions[“created_at”] = pd.to_datetime(amazon_transactions[“created_at”]).dt.strftime(‘%m-%d-%Y’)
Step 3: Arrange dates
in ascending order
In the next step, we need to put the date values (from column created_at) in ascending order. When it comes to dates, ascending order means before to after.
We use another specific function of the Pandas library sort_values(). It is necessary to organize the values in ascending order, based on the values of the user_id and created_at columns.
df = amazon_transactions.sort_values(by=[‘user_id’, ‘created_at’], ascending=[True, True])
Step 4: Find
the previous purchase date
In this step, we create a new column to store the closest previous date when the same user placed an order.
We generate values for the new column by moving the values in the created_at column one row down. This is necessary to store the previous time the user placed an order. To achieve this, we will use the shift() function of the Pandas library. It allows us to move the index of values from a specific column.
We give the new column a descriptive name prev_value. For each row, the value in column prev_value will be equal to created_at value in the previous row. For the second row, the prev_value column will have a value of created_at column in the first row.
df[‘prev_value’] = df.groupby(‘user_id’)[‘created_at’].shift()
Step 5: Find repeat orders in 7 days
Finally, we find the difference between multiple orders placed by the same user by subtracting date values in created_at and prev_value columns. For that, we once again use the pd.to_datetime() function and the dt.days function to find the number of days that is the difference between two dates.
The dt.days() function takes a date value and returns the number of days.
df[‘days’] = (pd.to_datetime(df[‘created_at’]) – pd.to_datetime(df[‘prev_value’])).dt.days result = df[df[‘days’] <= 7][‘user_id’].unique()
We find all instances where the same user placed orders within 7 days and store them in the result variable. We use the unique() function to return only one repeat order for each user. Applying this function to a hash table maintains only unique values.
Here is the final solution:
import pandas as pd import numpy as np from datetime import datetime amazon_transactions[“created_at”] = pd.to_datetime(amazon_transactions[“created_at”]).dt.strftime(‘%m-%d-%Y’) df = amazon_transactions.sort_values(by=[‘user_id’, ‘created_at’], ascending=[True, True]) df[‘prev_value’] = df.groupby(‘user_id’)[‘created_at’].shift() df[‘days’] = (pd.to_datetime(df[‘created_at’]) – pd.to_datetime(df[‘prev_value’])).dt.days result = df[df[‘days’] <= 7][‘user_id’].unique()
Output
Running this Python code will return unique identification values of users who placed an order within 7 days
.
As you can see, in the process of finding an answer, we make extensive use of the functions of the Pandas library. Python code looks a little different than SQL query, but both follow the same logical pattern.
Question 2 – Second
highest salary
To find the second highest salary, we will need to classify the values in the salary column. The panda library provides all the necessary functions to solve this question in Python.
Link to question
: https://platform.stratascratch.com/coding/9892-second-highest-salary
available data It is good practice to study the
available data
before answering the question. For this question, we are dealing with an employee table with fourteen columns.
The salary column contains integer (numeric) values to describe each employee’s salary.
Now, let’s look at the actual
data in the table:
Step-by-step code solution
Step 1: Import pandas as pd in Python
We need to use pandas library functions to work with tabular data. In the following steps, it will become clear why it is so important to import pandas and numpy.
Import pandas as PD Import numpy as NP
Step 2: Get a single list of salaries
To start, we use the drop_duplicates() function of the Pandas library, which removes rows with duplicate values in a specific column. In this case, the salary column. This leaves us with the list of unique salary values.
distinct_salary = employee.drop_duplicates(subset = ‘salary’)
Step 3: Classify salaries
The question asks us to find the second highest salary. Therefore, we will have to use the rank() function, which is a Pandas alternative to sorting window functions in SQL. Assigns a numeric value to each row, corresponding to its range.
We create an ‘rnk’ column to classify the values in the distinct_salary dataset. When calling the rank() function, we provide two arguments, one for specifying the type of ranking to perform (dense) and one for the order of values.
distinct_salary[‘rnk’] = distinct_salary[‘salary’].rank(method=’dense’, ascending=False)
In SQL, we have window sorting functions. The panda library provides only the rank() functions, but we can specify the type of classification through arguments (method=’dense’).
Dense classification follows this principle to deal with cases where multiple values are linked: if two wage values are equal, both will receive the same rank. It also does not skip the next rank when the values are tied.
Step 4: Find the second highest
salary Once we rank all salaries, all that’s
left to do is return a salary that has a rank value of
2. result = distinct_salary[distinct_salary.rnk == 2][[‘salary’]]
The final solution will look like this
:import pandas as pd import numpy as np distinct_salary = employee.drop_duplicates(subset = ‘salary’) distinct_salary[‘rnk’] = distinct_salary[‘salary’].rank(method=’dense’, ascending=False) result = distinct_salary[distinct_salary.rnk == 2][[‘salary’]]
Output
Running the above code will return only one column with the second highest salary:
Question 3 – Income over
time This is a difficult question with many moving parts. We will need to import pandas as pd to use functions to work with date values. We also need this library to use other essential utilities, such as groupby() function, aggregate functions, and rolling() function, which is very useful for this particular question.
Link to question
: https://platform.stratascratch.com/coding/10314-revenue-over-time
Available
data
Looking at the data can help you better understand the question. In this case, we are dealing with data contained in table amazon_purchases with three columns.
Previewing the table can help you better understand
the data:Step-by-step code
solution
Step 1: Import pandas as pd
in Python
As a proficient data scientist in Python, you are expected to know what tools you will need to answer the interview question using the fewest lines of code.
Pandas It is one of the essential libraries that provides functions such as to_datetime(), to_period), group(), sum(), which you will need to find the answer.
Import pandas as PD Import numpy as NP from DateTime Import DateTime
Step 2: Set the format for float values
First, we modify pd.options to specify how numbers are displayed when generating a DataFrame
. pd.options.display.float_format = “{:,.2f}”.format
In this case, we tell the pandas that generate float values to 2 decimal places and use a comma separator for readability.
Step 3: Filter the negative values of the purchase amount
The question tells us not to take into account negative purchase_amt values, which represent returns. We set the condition to store rows where the value of the purchase_amt column is greater than 0. We store them in the variable df.
df=amazon_purchases[amazon_purchases[‘purchase_amt’]>0]
Step 4: Convert datetime
values to months
In the next step, we need to take created_at values in the available data and use the pd.to_datetime and dt.to_period function to transform them into month values.
df[‘month_year’] = pd.to_datetime(df[‘created_at’]).dt.to_period(‘M’)
dt.to_period is another library function Pandas, which takes datetime values to a period value, such as year, day, or month name of the datetime value.
Step 5: Get
the purchase totals for each month We use the groupby() function to
aggregate purchases for the month in which they occurred, and we use the sum() function to calculate the total purchase_amt values for each group. Both functions are from the panda library.
groupby() works similarly to the GROUP BY statement in SQL. It creates multiple groups based on multiple rows that have the same value in a specific column, in this case the ‘month_year’ column.
The sum() function works similarly to the aggregate function of the same name in SQL. Add values in the ‘purchase_amt’ column of each group.
df1 = df.groupby(‘month_year’)[‘purchase_amt’].sum().reset_index(name=’monthly_revenue’).sort_values(‘month_year’) df1.set_index(‘month_year’, inplace=True)
We use the set_index() method to set month_year as the index column for the df1 data frame. The values in this column become identifiers for each row. Establishing the index is necessary for the accuracy and efficiency of our data analysis.
Step 6: Calculate the 3-month moving average
Finally, we use the rolling() function from the Pandas library to calculate a three-month continuous income. This is a function used to make rolling window calculations. It takes two arguments: the number of periods to calculate the rolling window and the minimum period.
We created a new variable rolling_windows to store three-month income. We need to use the mean() function to calculate three-month moving averages. Finally, we use the to_records() function to convert the DataFrame structure into an array of NumPy records.
rolling_windows = df1.rolling(3, min_periods=1) rolling_mean = rolling_windows.mean() result=rolling_mean.to_records()
Here is the final answer to this question
:import pandas as pd import numpy as np from datetime import datetime pd.options.display.float_format = “{:,.2f}”.format df=amazon_purchases[amazon_purchases[‘purchase_amt’]>0] df[‘month_year’] = pd.to_datetime(df[‘created_at’]).dt.to_period(‘M’) df1 = df.groupby(‘month_year’)[‘purchase_amt’].sum(). reset_index(name=’monthly_revenue’).sort_values(‘month_year’) df1.set_index(‘month_year’, inplace=True) rolling_windows = df1.rolling(3, min_periods=1) rolling_mean = rolling_windows.mean() result=rolling_mean.to_records()
Output
Running this code will generate a 3-month moving average for each month of the year:
Question 4 – Top percentile
fraud In this question, we have to find the top 5th percentile claims for each state. For that, we’ll need to use the groupby() and rank() functions, similar to the window classification functions in SQL. The panda library provides these and all the other functions needed to get the answer.
Link to question:
https://platform.stratascratch.com/coding/10303-top-percentile-fraud Solution
Another challenge that can be answered using the panda library in Python. First, we need to use certain pandas library functions to generate percentile values for each state and store them in a separate column.
Once we have percentile values in a separate column, we’ll need to filter them. The question tells us to find records with percentile values of 0.95 or more. Perhaps it would be a good idea to store the filtered records in a variable.
As a last step, we will need to generate four columns specified in the question description.
Try to answer the question
yourself below:
Question 5 – Total cost
of
orders The premise of the question is quite simple. We have to add the total cost of orders placed by customers. We will need panda library functions to do the aggregation and other tasks to get the answer.
Link to question: https://platform.stratascratch.com/coding/10183-total-cost-of-orders
Solution
For this question, we have two DataFrames (tables): customers and orders. None of these tables alone contains enough information to find the answer.
The pandas library comes with a very useful merge() function, which has similar functionality to JOINs in SQL. For arguments, we provide the names of each table and columns that contain overlapping values.
The question asks us to calculate the total cost of orders for each user. Most likely, there are several orders placed by a user. In this case, you will need to use groupby() and sum() to add the total cost of orders for each user.
Finally, you’ll need a certain panda library feature to arrange the rows in an ascending alphabetical order of names. And we must not forget to generate the columns specified in the description.
Try to get the answer yourself:
Check out our post “Python Pandas Interview Questions” to practice more such questions
.
Conclusion
In this article, we explain how to import pandas as pd in python and its importance for doing data science. Mastering its most essential features can bring you one step closer to getting a job and accelerating your career once you get it.
If you haven’t mastered pandas and Python yet, the StrataScratch platform is a great place to start.
It has dozens of practical interview questions that require you to import and use functions from the panda library. The platform allows you to view available data, see focus suggestions if you’re stuck, check the output of your code as you write it, and other features for interactive learning.
In short, StrataScratch has everything you need to hone your panda and Python skills. And as you may have heard, practice makes perfect, especially when it comes to technical skills like doing data science in Python.