Share on your Social Media

Python Libraries for Data Analysis: Numpy, Pandas, and Matplotlib

Published On: April 21, 2023

Introduction:

Data analysis is a crucial step in the process of making informed decisions based on data. It involves the transformation of raw data into a format that is suitable for analysis. Python has emerged as one of the most popular languages for data analysis due to its powerful libraries and easy-to-learn syntax. In this article, we will focus on three of the most popular Python libraries for data analysis: NumPy, Pandas, and Matplotlib.

NumPy

NumPy (Numerical Python) is one of the Python libraries for data analysis that is open source and utilized in practically every discipline of science and engineering. It is the universal standard in Python for handling numerical data. Users of NumPy range from novice programmers to experienced researchers working on cutting-edge scientific and industrial research and development. The majority of other Python data science and scientific programs, including Pandas, SciPy, Matplotlib, scikit-learn, and scikit-image, make significant use of the NumPy API. Its core object is the ndarray (n-dimensional array), which is a homogeneous container for storing numerical data. NumPy arrays are more efficient than traditional Python lists because they are homogeneous, meaning that they only store elements of the same type. This allows for fast and memory -efficient computations on large datasets.

Creating Numpy Arrays

Numpy’s core object is the ndarray (n-dimensional array), which is a homogeneous container for storing numerical data. NumPy arrays are more efficient than traditional Python lists because they are homogeneous, meaning that they only store elements of the same type. This allows for fast and memory-efficient computations on large datasets.

Here’s an example of creating a simple 1-dimensional Numpy array:

import numpy as np

a = np.array([1, 2, 3, 4, 5])

print(a)

Output:

[1 2 3 4 5]

In this example, we’ve created a 1-dimensional Numpy array of integers using the np.array() function. We’ve passed in a list of integers, and Numpy has automatically created an array with the same number of elements.

We can also create multi-dimensional arrays using the np.array() function. Here’s an example of creating a 2-dimensional Numpy array:

b = np.array([[1, 2, 3], [4, 5, 6]])

print(b)

Output:

[[1 2 3]

[4 5 6]]

In this example, we’ve created a 2-dimensional Numpy array of integers using the np.array() function. We’ve passed in a list of lists, and Numpy has automatically created a 2-dimensional array with the same number of rows and columns.

Numpy also provides many functions for creating arrays with specific properties. For example, we can create an array of zeros or ones using the np.zeros() and np.ones() functions, respectively:

c = np.zeros((3, 4))

print(c)

Output:

[[0. 0. 0. 0.]

[0. 0. 0. 0.]

[0. 0. 0. 0.]]

In this example, we’ve created a 2-dimensional array of zeros with 3 rows and 4 columns using the np.zeros() function.

Indexing Numpy Arrays

We can access individual elements of a Numpy array using indexing. Here’s an example of accessing the second element of the a array:

print(a[1])

Ouput:

In this example, we’ve accessed the second element of the ‘a’ array using indexing. Note that indexing in Numpy arrays starts from 0.

Slicing Numpy Arrays

We can also use slicing to access a subset of the elements of a Numpy array. Here’s an example of accessing the first three elements of the a array:

print(a[:3])

Output:

[1 2 3]

In this example, we’ve used slicing to access the first three elements of the a array. The : operator specifies the range of indices to include, and the first index is inclusive while the last index is exclusive.

Reshaping Numpy Arrays

We can reshape Numpy arrays using the np.reshape() function. Here’s an example of reshaping the a array into a 2-dimensional array with 5 rows and 1 column:

a_reshaped = np.reshape(a, (5, 1))

print(a_reshaped)

Output:

[[1]

[2]

[3]

[4]

[5]]

In this example, we’ve used the np.reshape() function to reshape the a array into a 2-dimensional array with 5 rows and 1 column.

Pandas:

Pandas is another among the three Python libraries for data analysis in this article. One of the most well-known Python libraries for data analysis, pandas was created by Wes McKinney in 2008 in response to a demand for a strong and adaptable tool for quantitative analysis. It has a very vibrant contributor community.

Two essential Python libraries—NumPy for mathematical operations and Matplotlib for data visualisation—serve as the foundation upon which Pandas is built. Pandas functions as a wrapper for these libraries, allowing you to use fewer lines of code to access various Matplotlib and NumPy methods. For instance, the.plot() function in pandas combines several matplotlib methods into one method, allowing you to plot a chart in fewer lines of code.

Pandas is a prominent open-source data manipulation library that is widely utilized in data analysis and data science projects. It offers user-friendly data structures and data analysis tools that let users manage and work with enormous data sets effectively. We’ll go over some of the most important Pandas features in the next section along with programming examples.

Data Structures

The two main data structures offered by the Python library for data analysis, Pandas are Series and DataFrame.

A Series is a one-dimensional array-like object that can contain any form of data. It is similar to a spreadsheet column or a SQL table. A series can be built by giving a list or array of values to the Series constructor, as demonstrated below:

import pandas as pd

# Creating a Series from a list

my_list = [10, 20, 30, 40, 50]

my_series = pd.Series(my_list)

print(my_series)

Output:

0 10

1 20

2 30

3 40

4 50

dtype: int64

A two-dimensional tabular data structure with labelled axes is known as a DataFrame. (rows and columns). It resembles a spreadsheet or a SQL table. CSV files, Excel files, SQL databases, and Python dictionaries are just a few of the sources that can be used for creating DataFrames. A DataFrame can be created from a dictionary using the example below:

# Creating a DataFrame from a dictionary

my_dict = {‘name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’, ‘Emily’],

‘age’: [25, 32, 18, 47, 29],

‘city’: [‘New York’, ‘Los Angeles’, ‘Chicago’, ‘Houston’, ‘Miami’]}

my_df = pd.DataFrame(my_dict)

print(my_df)

Output:

	name	age	city
0	Alice	25	New York
1	Bob	32	Los Angeles
2	Charlie	18	Chicago
3	David	47	Houston
4	Emily	29	Miami

Data Manipulation

Data manipulation tools like filtering, grouping, and aggregation are all powerfully offered by Pandas.

Filtering

Boolean indexing can be used to filter DataFrames. The example that follows demonstrates how to filter a DataFrame to only choose rows where the age is more than 30:

# Filtering a DataFrame

my_filtered_df = my_df[my_df[‘age’] > 30]

print(my_filtered_df)

Output:

	name	age	city
1	Bob	32	Los Angeles
3	David	47	Houston
4	Emily	29	Miami

Grouping

Using the groupby() method, DataFrames can be grouped by one or more columns. The example below demonstrates how to group a DataFrame by city and determine the average age for each city:

# Grouping a DataFrame

my_grouped_df = my_df.groupby(‘city’).mean()

print(my_grouped_df)

Output:

City	Age
Chicago	18.0
Houston	47.0
Los Angeles	32.0
Miami	29.0
New York	25.0

Aggregation

The agg() method is used to aggregate DataFrames. For each city, the lowest, maximum, and average ages can be determined using the example below:

# Aggregating a DataFrame

my_agg_df = my_df.groupby(‘city’).agg({‘age’: [‘min’, ‘max’, ‘mean’]})

print(my_agg_df)

Output:

age	city
Chicago	min max mean

Matplotlib:

For data visualization in data science and research, a well-known tool which is mostly used among the Python libraries for data analysis is Matplotlib. You can generate an extensive range of plots and charts with Matplotlib, such as line charts, scatter plots, histograms, bar charts, and more.

Matplotlib must first be installed using ‘pip’ before you can use it. The following command can be used to import it into your Python code once it has been installed:

import matplotlib.pyplot as plt

This imports the Matplotlib ‘pyplot’ module and gives it the alias ‘plt’, which is frequently used in Matplotlib code.

Let’s begin by looking at an easy Matplotlib line chart example. Consider that we have two data arrays, ‘x’ and ‘y’, which represent the x and y values for the line chart. Here is how the chart can be created:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]

y = [2, 4, 6, 8, 10]

plt.plot(x, y)

plt.show()

This code first constructs the arrays ‘x’ and ‘y’, then uses the ‘plot()’ function to generate a line chart of ‘y’ against ‘x’. The ‘show()’ function displays the chart in a new window.

Let’s now have a look at a scatter plot created using Matplotlib as an example. Consider that we have two data arrays, ‘x’ and ‘y’, which represent the x and y values for the scatter plot. The plot can be created as follows:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]

y = [2, 4, 6, 8, 10]

plt.scatter(x, y)

plt.show()

This code generates a scatter plot of ‘y’ against ‘x’ using the ‘scatter()’ function. The resulting plot displays a number of points, each of which stands for a pair of ‘x’ and ‘y’ values.

Let’s now look at an illustration of a histogram made with Matplotlib. Assume we have a dataset represented by an array of data, x. The histogram can be created as shown below:

import matplotlib.pyplot as plt

x = [1, 2, 2, 3, 3, 3, 4, 4, 5]

plt.hist(x)

plt.show()

This code creates a histogram of the data in x using the hist() function. The frequency of values falling within each bin is depicted on the resulting graphic.

Conclusion:

NumPy, Pandas and Matplotlib are three powerful Python libraries for data analysis, providing efficient array manipulation, numerical computing, data manipulation and analysis, and high-quality visualizations to communicate results.

Share on your Social Media

Want to know more about becoming an expert in IT?

Click Here to Get Started

100% Placement
Assurance

Related Courses

Manual Testing Interview Questions and Answers

Published On: February 21, 2024

Manual Testing Interview Questions and Answers Before a software application is released into production, manual…

Amazon Fresher Salary

Published On: February 21, 2024

Amazon Fresher Salary The exceptional chance to shape your professional path by having total control…

Components of Selenium

Published On: February 21, 2024

Components of Selenium Introduction In the field of software testing and automation, selenium is the…

Differences between SQL and PLSQL

Published On: February 21, 2024

Differences between SQL and PLSQL Introduction SQL is the language that is predominantly used for…

Data Science & Business Intelligence

Cloud Computing

Data Warehousing

Robotic Process Automation (RPA) Training

DevOps Tools

Java Programming

Web Designing

Dot Net Programming

Software Testing

Hardware and Networking

Mobile App Development

Oracle Training

Reporting & BI Tools

Embedded Systems

Digital Marketing

Scripting Language

Database Administration

Linux Training

Language Training

Other Training

Share on your Social Media

Python Libraries for Data Analysis: Numpy, Pandas, and Matplotlib

Introduction:

NumPy

Creating Numpy Arrays

Indexing Numpy Arrays

Slicing Numpy Arrays

Reshaping Numpy Arrays

Pandas:

Data Structures

Data Manipulation

Filtering

Grouping

Aggregation

Matplotlib:

Conclusion:

Share on your Social Media

Want to know more about becoming an expert in IT?

100% PlacementAssurance

Related Courses

Related Posts

Manual Testing Interview Questions and Answers

Amazon Fresher Salary

Components of Selenium

Differences between SQL and PLSQL

Just a minute!

We are excited to get started with you

100% Placement
Assurance