Software Training Institute in Chennai with 100% Placements – SLA Institute

Easy way to IT Job

Share on your Social Media

Data Science with Python Tutorial

Published On: August 9, 2025

Getting started with a career in data science using Python can be intimidating. You may be wondering where to start, which tools to master, and how to put theory into practice. Beginners often find it challenging to grasp intricate libraries, write optimized code, and visualize data effectively. This data science with Python tutorial is intended to demystify data science with Python for you, presenting a clear and complete guide. Ready to revamp your career? Explore our Data Science with Python course syllabus to find out what you’ll learn!

Your Journey into Data Science with Python

In this complete data science with Python tutorial, we’ll go on a journey that will provide you with the basic skills and knowledge required to launch your career in this fast-paced field. We’re going to cover fundamental ideas, must-know libraries, and hands-on examples that will make your grasp of Python for data analysis unshakeable and enable you to tackle real-world data problems.

Why Python for Data Science?

Python has become the go-to language among data scientists across the globe, and for a good reason. Its readability, simplicity, and enormous library ecosystem make it a superpower for doing everything from data analysis and manipulation to machine learning and deep learning. 

With data science using Python, there is a thriving community and tons of resources available to aid in your learning. This is because with such broad usage, Python programming for data science skills are in great demand in the employment market.

Setting Up Your Data Science Environment

Before we go into the main part, let’s prepare your development environment. The most popular and suggested method of installing Python for data analysis is Anaconda. Anaconda is an open-source and free distribution of Python and R used for scientific computing, which features a package manager, environment manager, and a suite of hundreds of open-source packages.

Installing Anaconda

  • Download Anaconda: Go to the official Anaconda website (https://www.anaconda.com/products/distribution) and download the correct installer for your system (Windows, macOS, or Linux).
  • Install According to Installation Instructions: Execute the installer and stick to the instructions provided on the screen. It is usually best to stick to the default settings when installing.
  • Check Installation: After installation, navigate to your terminal or command prompt and execute python –version. You should be able to see the version of Python that was installed with Anaconda.

Introducing Jupyter Notebooks

Jupyter Notebooks is a web interactive application where you can write and share documents with live code, equations, visualizations, and narrative text. They are very popular for data science with Python because they’re interactive, so it’s easy to try out code, visualize data, and document the results.

To start a Jupyter Notebook:

  • Open your command prompt or terminal.
  • Type jupyter notebook and press Enter.
  • The Jupyter Notebook interface should open in a new tab in your web browser.

Suggested: Data Science Online Course.

Python Basics for Data Science

Before we move on to specialized data science libraries, let us review some of the basic Python programming for data science concepts necessary for efficient data manipulation.

Variables and Data Types

In Python, variables hold data. Data exists in different forms, which are referred to as data types.

# Integers (whole numbers)

age = 30

print(type(age)) # <class ‘int’>

# Floats (numbers with decimal points)

price = 19.99

print(type(price)) # <class ‘float’>

# Strings (text)

name = “Alice”

print(type(name)) # <class ‘str’>

# Booleans (True or False)

is_student = True

print(type(is_student)) # <class ‘bool’>

Lists: Ordered Collections

Lists are arranged collections of items that can be changed. In Python, they are essential for storing data sequences for data processing.

fruits = [“apple”, “banana”, “cherry”]

print(fruits[0]) # Output: apple (indexing starts from 0)

fruits.append(“orange”)

print(fruits) # Output: [‘apple’, ‘banana’, ‘cherry’, ‘orange’]

Dictionaries: Key-Value Pairs

Key-value pairs are mutable, unordered collections found in dictionaries. They are helpful for organizing and storing data.

person = {“name”: “Bob”, “age”: 25, “city”: “New York”}

print(person[“name”]) # Output: Bob

person[“occupation”] = “Engineer”

print(person) # Output: {‘name’: ‘Bob’, ‘age’: 25, ‘city’: ‘New York’, ‘occupation’: ‘Engineer’}

Control Flow: Conditionals and Loops

Control flow statements allow you to control the order in which your code is executed.

If-Else Statement:

score = 85

if score >= 90:

    print(“Excellent!”)

elif score >= 70:

    print(“Good job!”)

else:

    print(“Keep practicing.”)

For Loops:

for fruit in fruits:

    print(fruit)

While Loops:

count = 0

while count < 5:

    print(count)

    count += 1

Functions: Reusable Blocks of Code

Functions encourage modularity in your Python data analytics programming by enabling you to encapsulate a piece of code that may be used repeatedly.

def greet(name):

    return f”Hello, {name}!”

message = greet(“Data Scientist”)

print(message) # Output: Hello, Data Scientist!

Recommended: Core Python Course Online.

Indispensable Libraries of Data Science with Python

Let’s now delve into the powerhouse libraries that make data science and Python so powerful.

NumPy: The Numerical Python

The NumPy package is the foundation of numerical computation in Python. It offers support for multi-dimensional arrays and matrices of large, arbitrary sizes, and a variety of high-level mathematical functions to operate on arrays. NumPy is the foundation upon which many other data scientists’ libraries in Python rely.

Important Features of NumPy:
  • Efficient Arrays: NumPy arrays are much faster than Python lists for numerical computation.
  • Vectorization: Compute on entire arrays without using explicit loops, resulting in higher-performance execution.
  • Mathematical Functions: An extensive list of functions for linear algebra, Fourier transform, and random number generation.

Example:

import numpy as np

# Creating a NumPy array

arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(type(arr)) # <class ‘numpy.ndarray’>

# Performing element-wise operations

arr_plus_one = arr + 1

print(arr_plus_one)

# Multiplying arrays

arr2 = np.array([10, 20, 30, 40, 50])

product = arr * arr2

print(product)

# Multi-dimensional arrays (matrices)

matrix = np.array([[1, 2, 3], [4, 5, 6]])

print(matrix)

print(matrix.shape) # Output: (2, 3) (2 rows, 3 columns)

Pandas: Data Analysis and Manipulation

Pandas is perhaps the most valuable data analysis library with Python. It offers powerful, flexible, and simple-to-use data structures, most significantly the DataFrame, well-suited for tabular data. Mastering Pandas is a requirement if you’re serious about data analysis with Python.

Major Features of Pandas:
  • DataFrame: It is a two-dimensional, size-mutable, tabular data structure with labeled axes (rows and columns).
  • Series: A one-dimensional array with labels that can store any data type.
  • Data Cleaning and Preparation: Dealing with missing values, reshaping data, combining datasets, etc.
  • Data Exploration: Computing descriptive statistics, grouping data, and aggregation.
  • Input/Output Tools: Reading and writing data in different formats (CSV, Excel, SQL databases, etc.).

Let’s look at some typical Pandas operations:

import pandas as pd

# Creating a Series

s = pd.Series([10, 20, 30, 40, 50])

print(s)

# Creating a DataFrame from a dictionary

data = {

    ‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’],

    ‘Age’: [25, 30, 35, 28],

    ‘City’: [‘New York’, ‘London’, ‘Paris’, ‘New York’]

}

df = pd.DataFrame(data)

print(df)

# Accessing columns

print(df[‘Name’])

print(df.Age) # Another way to access columns

# Accessing rows (using .loc for label-based indexing)

print(df.loc[0]) # Access the first row

print(df.loc[df[‘City’] == ‘New York’]) # Filter rows based on a condition

# Adding a new column

df[‘Salary’] = [70000, 80000, 90000, 75000]

print(df)

# Basic descriptive statistics

print(df.describe())

# Grouping data

city_age = df.groupby(‘City’)[‘Age’].mean()

print(city_age)

# Reading data from a CSV file (assuming ‘sales_data.csv’ exists)

# sales_df = pd.read_csv(‘sales_data.csv’)

# print(sales_df.head()) # Display the first 5 rows

Matplotlib and Seaborn: Data Visualization

An essential stage in data research using Python is data visualization. It helps us learn about the patterns, trends, and outliers in our data. 

Matplotlib is the base plotting library in Python, and Seaborn is a higher-level library based on Matplotlib that gives us a high-level interface for plotting nice-looking and informative statistical graphics. Both these libraries are essential for python for data scientists. 

Matplotlib Basics:

import matplotlib.pyplot as plt

import numpy as np

# Simple line plot

x = np.linspace(0, 10, 100)

y = np.sin(x)

plt.plot(x, y)

plt.title(“Sine Wave”)

plt.xlabel(“X-axis”)

plt.ylabel(“Y-axis”)

plt.show()

# Scatter plot

x_scatter = np.random.rand(50) * 10

y_scatter = np.random.rand(50) * 10

plt.scatter(x_scatter, y_scatter)

plt.title(“Random Scatter Plot”)

plt.xlabel(“Feature 1”)

plt.ylabel(“Feature 2”)

plt.show()

# Histogram

data_hist = np.random.randn(1000)

plt.hist(data_hist, bins=30, edgecolor=’black’)

plt.title(“Histogram of Random Data”)

plt.xlabel(“Value”)

plt.ylabel(“Frequency”)

plt.show()

Learn more with our Python full stack course in Chennai.

Seaborn for Enhanced Visualizations:

Particularly for statistical data, Seaborn facilitates the creation of more intricate and visually appealing graphs.

import seaborn as sns

import pandas as pd

# Create a sample DataFrame for visualization

data_viz = {

    ‘Category’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘A’, ‘C’, ‘B’],

    ‘Value’: [10, 15, 12, 18, 13, 11, 20, 16],

    ‘Group’: [‘X’, ‘Y’, ‘X’, ‘Y’, ‘X’, ‘Y’, ‘X’, ‘Y’]

}

df_viz = pd.DataFrame(data_viz)

# Bar plot using Seaborn

sns.barplot(x=’Category’, y=’Value’, data=df_viz)

plt.title(“Value by Category”)

plt.show()

# Box plot (useful for visualizing distribution and outliers)

sns.boxplot(x=’Category’, y=’Value’, data=df_viz)

plt.title(“Box Plot of Value by Category”)

plt.show()

# Scatter plot with hue for an additional categorical variable

sns.scatterplot(x=’Value’, y=’Age’, hue=’City’, data=df)

plt.title(“Value vs Age by City”)

plt.show()

Explore: Python Tutorial for Beginners.

The Python Data Science Workflow

A Python data science project typically has a formal workflow. Familiarizing yourself with these steps is essential for any future data scientist.

  1. Data Acquisition and Collection

The initial step is collecting the data you require for your analysis. This might be from any of the following sources:

  • Databases: SQL databases (PostgreSQL, MySQL, SQLite), NoSQL databases (MongoDB, Cassandra).
  • APIs: Web services with programmatic access to data (e.g., Twitter API, weather APIs).
  • Web Scraping: Retrieving data directly from websites (with libraries such as Beautiful Soup or Scrapy).
  • Files: CSV, Excel, JSON, Parquet, HDF5.

In this tutorial, we will be working mostly with data from files, which is the most common beginning for data analysis for Python.

  1. Data Cleaning and Preprocessing

Data in the real world is not generally clean and ready to be analyzed. This is usually the most time-consuming aspect of Python and data science work. Some of the most important tasks are:

  • Handling Missing Values:
    • Imputation: Replacing missing values with an imputed value (mean, median, mode).
    • Deletion: Deleting rows or columns containing missing values (use sparingly).
  • Handling Outliers: Discovering and managing extreme values that can distort analysis.
  • Data Type Conversion: Converting columns to correct data types (e.g., from string ‘123’ to integer 123).
  • Removing Duplicates: Finding and eliminating duplicate rows.
  • Feature Engineering: Deriving new features from existing ones for better model performance.
  • Data Transformation: Scales or normalizes data (crucial for most machine learning algorithms).

Let’s demonstrate some of these with Pandas:

import pandas as pd

import numpy as np

# Create a sample DataFrame with missing values and duplicates

data_raw = {

    ‘ID’: [1, 2, 3, 4, 5, 1, 6],

    ‘Feature1’: [10, 20, np.nan, 40, 50, 10, 60],

    ‘Feature2’: [‘A’, ‘B’, ‘C’, ‘A’, np.nan, ‘A’, ‘D’],

    ‘Value’: [100, 150, 120, 100, 180, 100, 200]

}

df_raw = pd.DataFrame(data_raw)

print(“Original DataFrame:”)

print(df_raw)

# Check for missing values

print(“\nMissing values:”)

print(df_raw.isnull().sum())

# Fill missing numerical values with the mean

df_cleaned = df_raw.copy()

df_cleaned[‘Feature1’].fillna(df_cleaned[‘Feature1’].mean(), inplace=True)

# Fill missing categorical values with the mode

df_cleaned[‘Feature2’].fillna(df_cleaned[‘Feature2’].mode()[0], inplace=True)

print(“\nDataFrame after filling missing values:”)

print(df_cleaned)

# Remove duplicate rows (based on all columns by default, or specific columns)

df_no_duplicates = df_cleaned.drop_duplicates()

print(“\nDataFrame after removing duplicates:”)

print(df_no_duplicates)

# Convert data type

df_no_duplicates[‘Value’] = df_no_duplicates[‘Value’].astype(float)

print(“\nDataFrame after converting ‘Value’ to float:”)

print(df_no_duplicates.dtypes)

Suggested Guide: Data Analytics Tutorial for Beginners.

  1. Exploratory Data Analysis (EDA)

EDA is an essential step when you employ visualization and statistical techniques to better comprehend your data. It allows you to discover patterns, detect anomalies, try out hypotheses, and verify assumptions. Python for data analysis really comes into its own here.

  • Descriptive Statistics: Compute mean, median, mode, standard deviation, quartiles, etc. (df.describe()).
  • Data Distribution: Plot the distribution of individual variables (histograms, box plots).
  • Relationships Between Variables: Investigate correlations between numerical variables (scatter plots, correlation matrices).
  • Categorical Variable Analysis: Examine the distribution of categorical variables and how they correlate with numerical variables (bar plots, count plots).

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

# Load a sample dataset (e.g., from Seaborn’s built-in datasets)

# Let’s use the ‘tips’ dataset, a classic for EDA examples

tips = sns.load_dataset(‘tips’)

print(“Tips Dataset Head:”)

print(tips.head())

print(“\nDescriptive Statistics for Numerical Columns:”)

print(tips.describe())

# Distribution of ‘total_bill’

sns.histplot(tips[‘total_bill’], kde=True)

plt.title(‘Distribution of Total Bill’)

plt.xlabel(‘Total Bill ($)’)

plt.ylabel(‘Frequency’)

plt.show()

# Relationship between ‘total_bill’ and ‘tip’

sns.scatterplot(x=’total_bill’, y=’tip’, data=tips, hue=’time’, style=’smoker’)

plt.title(‘Total Bill vs. Tip’)

plt.xlabel(‘Total Bill ($)’)

plt.ylabel(‘Tip ($)’)

plt.show()

# Box plot of ‘tip’ by ‘day’

sns.boxplot(x=’day’, y=’tip’, data=tips)

plt.title(‘Tip Amount by Day of the Week’)

plt.xlabel(‘Day’)

plt.ylabel(‘Tip ($)’)

plt.show()

# Correlation matrix (for numerical columns)

correlation_matrix = tips[[‘total_bill’, ‘tip’, ‘size’]].corr()

print(“\nCorrelation Matrix:”)

print(correlation_matrix)

sns.heatmap(correlation_matrix, annot=True, cmap=’coolwarm’)

plt.title(‘Correlation Matrix of Numerical Features’)

plt.show()

  1. Machine Learning (Modeling)

After your data is clean and well-understood, you can proceed with developing machine learning models. Scikit-learn is the widely used machine learning library for Python. Scikit-learn offers an extensive variety of algorithms for:

Supervised Learning:
  • Regression: Predicting a continuous output (e.g., house prices prediction).
    • Linear Regression
    • Decision Tree Regressor
    • Random Forest Regressor
  • Classification: Predicting a categorical output (e.g., determining whether an email is spam).
    • Logistic Regression
    • Decision Tree Classifier
    • Support Vector Machines (SVM)
    • K-Nearest Neighbors (KNN)
Unsupervised Learning:
  • Clustering: Clustering similar data points (e.g., customer segmentation).
    • K-Means Clustering
    • DBSCAN
  • Dimensionality Reduction: Reducing the number of features while retaining the key information (e.g., Principal Component Analysis – PCA).

Let’s do a simple linear regression example using Python programming for data science.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

import numpy as np

# Create some sample data for linear regression

# Let’s say we want to predict ‘Y’ based on ‘X’

np.random.seed(0)

X = 2 * np.random.rand(100, 1) # 100 samples, 1 feature

y = 4 + 3 * X + np.random.randn(100, 1) * 2 # y = 4 + 3X + noise

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model

model = LinearRegression()

# Train the model

model.fit(X_train, y_train)

# Make predictions on the test set

y_pred = model.predict(X_test)

# Evaluate the model

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f”\nMean Squared Error: {mse:.2f}”)

print(f”R-squared: {r2:.2f}”)

# Plot the regression line

plt.scatter(X_test, y_test, color=’black’, label=’Actual data’)

plt.plot(X_test, y_pred, color=’blue’, linewidth=3, label=’Regression line’)

plt.title(‘Linear Regression Example’)

plt.xlabel(‘X’)

plt.ylabel(‘Y’)

plt.legend()

plt.show()

Learn more with our machine learning course online.

  1. Model Evaluation and Deployment

Once a model is constructed, it’s essential to analyze its performance using the right metrics (e.g., precision, recall, F1-score for classification; RMSE, R-squared for regression). 

Once the model has been verified to be effective, it can then be deployed in real-world applications, usually by incorporating it into a web application or existing system. This is where the real-world impact is created in Python and data science.

Tips for Data Science Beginners with Python

  • Practice Regularly: The best method to learn data science with Python programming is by practicing. You can work on small projects, engage in Kaggle competitions, and implement what you have learned.
  • Learn the Math: Although Python does the dirty work, having an elementary knowledge of statistics, linear algebra, and calculus will be a big plus.
  • Master Pandas: Take time to become experts with Pandas. It’s the horse for Python data analysis.
  • Learn to Ask Questions: Feel free to utilize online forums such as Stack Overflow or documentation. The Python for data science with Python community is large and helpful.
  • Version Control (Git/GitHub): Learn Git for version control. It is needed for collaborative work on projects and monitoring your code.
  • Explore Different Datasets: Work with diverse datasets to gain experience with different data types and challenges.
  • Stay Updated: The field of data science with Python is constantly evolving. Keep learning about new libraries, techniques, and best practices.

Explore: All Trending Courses to Begin Your IT Career.

Conclusion

Congratulations on taking your first steps into the exciting world of data science with Python! You’ve acquired the knowledge of installing your environment, basic Python ideas, and the basic libraries such as NumPy, Pandas, Matplotlib, and Seaborn. We’ve even gotten a brief touch of the fundamental workflow of a data science project, from cleaning the data to simple modeling. 

Remember, expertise comes by practice. Keep learning, keep programming, and keep building! Are you ready to further hone your skills and become an expert data scientist? Discover our in-depth Data Science with Python course to gain expert-level techniques and hands-on project experience.

Share on your Social Media

Just a minute!

If you have any questions that you did not find answers for, our counsellors are here to answer them. You can get all your queries answered before deciding to join SLA and move your career forward.

We are excited to get started with you

Give us your information and we will arange for a free call (at your convenience) with one of our counsellors. You can get all your queries answered before deciding to join SLA and move your career forward.