Import pandas as pd: an essential Python library for data scientists. Once you import it, you can take your data analytics to a whole new level.

As a general-purpose programming language, Python has all the features needed to analyze and gain insights from data. For this reason, interviewers often expect future data scientists to be proficient in Python.

While it has its own advantages, such as readability and flexibility, the standard Python language is insufficient to perform serious data analysis. Fortunately, it supports external libraries to extend the basic functionality of Python. **Pandas **is one of the libraries that extends the functionality of the language beyond basic features and provides a toolkit for performing common data science tasks.

Pandas in

## Data Science Pandas

is an essential tool for doing data science in Python. It allows you to perform many tasks that would be almost impossible with the standard language: working with data in various formats (tabular or labeled), sorting and formatting it, combining data from multiple sources, finding and cleaning messy data, and even visualizing it.

The Pandas API is easy to follow, especially if you already know SQL. The functions are well documented and descriptively named. Many of the features have the same name as their counterparts in SQL.

If the job you’re applying for requires some knowledge of Python, you’ll have to make a lot of use of panda library functions. Interviewers can ask you questions to gauge your ability to make use of all the features of the **panda** library.

## What can pandas do?

The short answer is that you need **Pandas **to get access to many functions to perform data analysis in Python. It is especially useful for doing data modeling, data analysis, and data manipulation.

**Pandas **provides two Series and **DataFrame** data structures, which give you the flexibility to work with datasets in different formats, such as tabular, time series and matrix data.** ** **The DataFrame **provides the basis for performing essential operations such as aligning data, calculating statistics, segmenting, grouping, merging, adding, or splitting datasets. It is useful for working with three-dimensional data from tables.

For example, **the sort_values**() and **sort_index() **functions are commonly used to sort and organize data. **merge() **allows you to combine and work with data from two tables, similar to JOINs in SQL. You can even specify the shared dimension and method for combining data. **Concat() function **for joining strings and arrays.

The **read_csv**() and **read_excel() **function allows you to work with data from Excel spreadsheets. Calling the **df.head() **function will return the first five rows by default. You can provide an argument to specify the number of rows that you want it to return.

The functions **of the panda **library can be used to obtain information about the available data. **df.shape() **returns information about the number of columns and rows. **df.size() **returns the number of rows multiplied by the number of columns. **df.info() **can give us information about the value types for each column.

Functions such as **df.to_datetime**(), df.to_period(), dt.year(**), dt.month() **and the like are very useful for working with datetime values.

The library also contains features for cleaning up messy data. You can configure conditions, limits, and formatting rules to detect outliers. You can also define how to correct erroneous data. Doing this will help you eliminate unrealistic values, which can improve the accuracy of your data analysis.

How to import

## pandas as pd

in Python

Installing Python alone is not enough to import the **Pandas** library, you need to install the library itself. It only supports Python version 2.7 or higher.

Newer versions of Python (3.4 and higher) include the installation of **pip**, a tool for installing packages such as the **Pandas** library. If you are running an older version of Python, you can install **pip **manually or download the latest version of Python on https://www.python.org.

Once you have configured pip, open the command prompt and type the following command:pip install pandas After the installation is complete, the command prompt

will display some kind of success message, such as: Successfully installed pandas-0.24.2

You can also download

** **and install Pandas

without using the command ** **prompt. Manually download and install **the Anaconda** software, which also automatically installs **Pandas**.

**Note:** During the installation process, it is essential to tick the checkbox for the ‘Add Anaconda to my PATH environment variable’ option.

Once **Pandas **is installed, you can import pandas as pd and its main library, **numpy**: import pandas as pd import numpy **as **

np

We can simply type ‘import pandas’ to make this module available. However, it is good practice to use the ace **keyword to give **it an abbreviated name ‘pd’. Still, this is not a requirement and you can simply type ‘import pandas’.

In this example, we also imported **numpy**, another data science library, which is often used with **pandas**.

Examples of issues

## where you need to import pandas as pd

The panda library provides essential features for performing data analysis in Python. For this reason, solutions to Python interview questions are often based on features from the Pandas library.

### Question 1 – Find user purchases

Solving this question boils down to working with purchase dates to find cases where a single user ordered twice within a 7-day period. The functions **of the panda** library will be essential to find an answer.

Link to question: https://platform.stratascratch.com/coding/10322-finding-user-purchases

**Available**

data

The first step in finding an answer is to look at the available data. In this case, we only have one **amazon_transactions **table with five columns.

Let’s preview the actual

data in the table:

#### Step-by-step code solution

**Step 1: Import pandas as pd in Python**

The panda and Numpy libraries provide an essential toolset for performing complex data science tasks

.

Almost all the functions we use to answer this question come from the **panda** library.

Import pandas as PD Import numpy as NP from DateTime Import datetime

The Datetime library is useful for working with

date values

**Step 2: Convert**

date values First, we

need to convert the values in the **created_at **column to the **datetime **type and format them

.

For that, we use the **to_datetime**() and **dt.strftime() **functions of the **pandas** library.

**pd.to_datetime() **an argument, a DataFrame dataset that must be converted to a datetime object. **dt.strftime() **is used to specify the date format.

amazon_transactions[“created_at”] = pd.to_datetime(amazon_transactions[“created_at”]).dt.strftime(‘%m-%d-%Y’)

**Step 3: Arrange dates**

in ascending order

In the next step, we need to put the date values (from column **created_at**) in ascending order. When it comes to dates, ascending order means before to after.

We use another specific function **of the Pandas ** library **sort_values()**. It is necessary to organize the values in ascending order, based on the values of the **user_id **and **created_at** columns.

df = amazon_transactions.sort_values(by=[‘user_id’, ‘created_at’], ascending=[True, True])

**Step 4: Find**

the previous purchase date

In this step, we create a new column to store the closest previous date when the same user placed an order.

We generate values for the new column by moving the values in the **created_at ** column one row down. This is necessary to store the previous time the user placed an order. To achieve this, we will use the **shift() **function of the **Pandas** library. It allows us to move the index of values from a specific column.

We give the new column a descriptive name **prev_value**. For each row, the value in column **prev_value** will be equal to **created_at **value in the previous row. For the second row, the **prev_value **column will have a value of **created_at **column in the first row.

df[‘prev_value’] = df.groupby(‘user_id’)[‘created_at’].shift()

**Step 5: Find repeat orders in 7 days**

Finally, we find the difference between multiple orders placed by the same user by subtracting date values in **created_at **and **prev_value **columns. For that, we once again use the **pd.to_datetime() **function and the **dt.days **function to find the number of days that is the difference between two dates.

The **dt.days() **function takes a date value and returns the number of days.

df[‘days’] = (pd.to_datetime(df[‘created_at’]) – pd.to_datetime(df[‘prev_value’])).dt.days result = df[df[‘days’] <= 7][‘user_id’].unique()

We find all instances where the same user placed orders within 7 days and store them in the **result** variable. We use the **unique()** function to return only one repeat order for each user. Applying this function to a hash table maintains only unique values.

Here is the final solution:

import pandas as pd import numpy as np from datetime import datetime amazon_transactions[“created_at”] = pd.to_datetime(amazon_transactions[“created_at”]).dt.strftime(‘%m-%d-%Y’) df = amazon_transactions.sort_values(by=[‘user_id’, ‘created_at’], ascending=[True, True]) df[‘prev_value’] = df.groupby(‘user_id’)[‘created_at’].shift() df[‘days’] = (pd.to_datetime(df[‘created_at’]) – pd.to_datetime(df[‘prev_value’])).dt.days result = df[df[‘days’] <= 7][‘user_id’].unique()

**Output**

Running this Python code will return unique **identification** values of users who placed an order within 7 days

.

As you can see, in the process of finding an answer, we make extensive use of the functions of the **Pandas** library. Python code looks a little different than SQL query, but both follow the same logical pattern.

### Question 2 – Second

highest salary

To find the second highest salary, we will need to classify the values in the **salary** column. The **panda **library provides all the necessary functions to solve this question in Python.

Link to question

: https://platform.stratascratch.com/coding/9892-second-highest-salary

available data It is good practice to study the

**available data**

before answering the question. For this question, we are dealing with an **employee **table with fourteen columns.

The salary column contains integer (numeric) values to describe each employee’s **salary**.

Now, let’s look at the actual

data in the table:

#### Step-by-step code solution

**Step 1: Import pandas as pd in Python**

We need to use **pandas **library functions to work with tabular data. In the following steps, it will become clear why it is so important to import **pandas **and **numpy**.

Import pandas as PD Import numpy as NP

**Step 2: Get a single list of salaries**

To start, we use the **drop_duplicates() **function of the Pandas library, which removes **rows** with duplicate values in a specific column. In this case, the **salary** column. This leaves us with the list of unique salary values.

distinct_salary = employee.drop_duplicates(subset = ‘salary’)

**Step 3: Classify salaries**

The question asks us to find the second highest salary. Therefore, we will have to use the **rank() **function, which is a Pandas alternative to sorting window functions in SQL. Assigns a numeric value to each row, corresponding to its range.

We create an ‘rnk’ column to classify the values in the **distinct_salary dataset**. When calling the rank(**) function, **we provide two arguments, one for specifying the type of ranking to perform (dense) and one for the order of values.

distinct_salary[‘rnk’] = distinct_salary[‘salary’].rank(method=’dense’, ascending=False)

In SQL, we have window sorting functions. The **panda **library provides only the **rank() functions, **but we can specify the type of classification through arguments (method=’dense’).

Dense classification follows this principle to deal with cases where multiple values are linked: if two **wage **values are equal, both will receive the same rank. It also does not skip the next rank when the values are tied.

**Step 4: Find the second highest**

salary Once we rank all salaries, all that’s

left to do is return a salary that has a rank value of

2. result = distinct_salary[distinct_salary.rnk == 2][[‘salary’]]

The final solution will look like this

:import pandas as pd import numpy as np distinct_salary = employee.drop_duplicates(subset = ‘salary’) distinct_salary[‘rnk’] = distinct_salary[‘salary’].rank(method=’dense’, ascending=False) result = distinct_salary[distinct_salary.rnk == 2][[‘salary’]]

**Output**

Running the above code will return only one column with the second highest salary:

### Question 3 – Income over

time This is a difficult question with many moving parts. We will need to import **pandas **as **pd **to use functions to work with date values. We also need this library to use other essential utilities, such as **groupby**() function, aggregate functions, and **rolling() **function, which is very useful for this particular question.

Link to question

: https://platform.stratascratch.com/coding/10314-revenue-over-time

**Available**

data

Looking at the data can help you better understand the question. In this case, we are dealing with data contained in table **amazon_purchases **with three columns.

Previewing the table can help you better understand

the data:Step-by-step code

#### solution

**Step 1: Import pandas as pd**

in Python

As a proficient data scientist in Python, you are expected to know what tools you will need to answer the interview question using the fewest lines of code.

**Pandas ** It is one of the essential libraries that provides functions such as to_datetime(), to_period), **group**(), **sum****()**, which you will need to find the answer.

Import pandas as PD Import numpy as NP from DateTime Import DateTime

**Step 2: Set the format for float values**

First, we modify **pd.options **to specify how numbers are displayed when generating a DataFrame

. pd.options.display.float_format = “{:,.2f}”.format

In this case, we tell **the pandas ** that generate float values to 2 decimal places and use a comma separator for readability.

**Step 3: Filter the negative values of the purchase amount**

The question tells us not to take into account negative **purchase_amt **values, which represent returns. We set the condition to store rows where the value of the **purchase_amt **column is greater than 0. We store them in the **variable df**.

df=amazon_purchases[amazon_purchases[‘purchase_amt’]>0]

**Step 4: Convert datetime**

values to months

In the next step, we need to take created_at values in the available data and use the pd.to_datetime and dt.to_period function ** **to transform them into month values. ** **** **

df[‘month_year’] = pd.to_datetime(df[‘created_at’]).dt.to_period(‘M’)

**dt.to_period **is another library function **Pandas**, which takes datetime values to a period value, such as year, day, or month name of the datetime value.

**Step 5: Get**

the purchase totals for each month We use the groupby() function to

aggregate purchases for the month in which they occurred, and we use the **sum() **function to ** **calculate the total **purchase_amt **values for each group. Both functions are from the **panda** library.

**groupby() **works similarly to the GROUP BY statement in SQL. It creates multiple groups based on multiple rows that have the same value in a specific column, in this case the ‘month_year’ column.

The **sum() **function works similarly to the aggregate function of the same name in SQL. Add values in the ‘purchase_amt’ column of each group.

df1 = df.groupby(‘month_year’)[‘purchase_amt’].sum().reset_index(name=’monthly_revenue’).sort_values(‘month_year’) df1.set_index(‘month_year’, inplace=True)

We use the **set_index()** method to set **month_year** as the index column for the df1 data frame. The values in this column become identifiers for each row. Establishing the index is necessary for the accuracy and efficiency of our data analysis.

**Step 6: Calculate the 3-month moving average**

Finally, we use the **rolling()** function from the **Pandas **library to calculate a three-month continuous income. This is a function used to make rolling window calculations. It takes two arguments: the number of periods to calculate the rolling window and the minimum period.

We created a new **variable rolling_windows** to store three-month income. We need to use the **mean() function to calculate three-month **moving averages. Finally, we use the **to_records() **function to convert the DataFrame structure into an array of **NumPy** records.

rolling_windows = df1.rolling(3, min_periods=1) rolling_mean = rolling_windows.mean() result=rolling_mean.to_records()

Here is the final answer to this question

:import pandas as pd import numpy as np from datetime import datetime pd.options.display.float_format = “{:,.2f}”.format df=amazon_purchases[amazon_purchases[‘purchase_amt’]>0] df[‘month_year’] = pd.to_datetime(df[‘created_at’]).dt.to_period(‘M’) df1 = df.groupby(‘month_year’)[‘purchase_amt’].sum(). reset_index(name=’monthly_revenue’).sort_values(‘month_year’) df1.set_index(‘month_year’, inplace=True) rolling_windows = df1.rolling(3, min_periods=1) rolling_mean = rolling_windows.mean() result=rolling_mean.to_records()

**Output**

Running this code will generate a 3-month moving average for each month of the year:

### Question 4 – Top percentile

fraud In this question, we have to find the top 5th percentile claims for each state. For that, we’ll need to use the **groupby() **and **rank() functions, **similar to the window classification functions in SQL. The **panda **library provides these and all the other functions needed to get the answer.

Link to question:

#### https://platform.stratascratch.com/coding/10303-top-percentile-fraud Solution

Another challenge that can be answered using the **panda **library in Python. First, we need to use certain **pandas **library functions to generate percentile values for each state and store them in a separate column.

Once we have percentile values in a separate column, we’ll need to filter them. The question tells us to find records with **percentile **values of 0.95 or more. Perhaps it would be a good idea to store the filtered records in a variable.

As a last step, we will need to generate four columns specified in the question description.

Try to answer the question

yourself below:

### Question 5 – Total cost

of

orders The premise of the question is quite simple. We have to add the total cost of orders placed by customers. We will need **panda **library functions to do the aggregation and other tasks to get the answer.

Link to question: https://platform.stratascratch.com/coding/10183-total-cost-of-orders

#### Solution

For this question, we have two DataFrames (tables): **customers **and **orders**. None of these tables alone contains enough information to find the answer.

The **pandas **library comes with a very useful **merge() **function, which has similar functionality to JOINs in SQL. For arguments, we provide the names of each table and columns that contain overlapping values.

The question asks us to calculate the total cost of orders for each user. Most likely, there are several orders placed by a user. In this case, you will need to use **groupby**() and **sum() **to add the total cost of orders for each user.

Finally, you’ll need a certain panda library ** **feature to arrange the rows in an ascending alphabetical order of names. And we must not forget to generate the columns specified in the description.

Try to get the answer yourself:

Check out our post “Python Pandas Interview Questions” to practice more such questions

.

#### Conclusion

In this article, we explain how to import pandas as pd in python and its importance for doing data science. Mastering its most essential features can bring you one step closer to getting a job and accelerating your career once you get it.

If you haven’t **mastered pandas **and Python yet, the StrataScratch platform is a great place to start.

It has dozens of practical interview questions that require you to import and use functions from the panda library. The platform allows you to view available data, see focus suggestions if you’re stuck, check the output of your code as you write it, and other features for interactive learning.

In short, StrataScratch has everything you need to hone your **panda **and Python skills. And as you may have heard, practice makes perfect, especially when it comes to technical skills like doing data science in Python.