Machine learning has become a cornerstone of modern data science, enabling systems to learn from data and make predictions. This post introduces several fundamental machine learning algorithms, their mathematical underpinnings, and provides sample code to illustrate their implementation.
Linear Regression
Linear regression is one of the simplest and most commonly used algorithms. It models the relationship between a dependent variable \(y\) and one or more independent variables \(x\) by fitting a linear equation:
\[y = \beta_0 + \beta_1 x + \epsilon\]where:
- \(\beta_0\) is the y-intercept,
- \(\beta_1\) is the slope of the line,
- \(\epsilon\) represents the error term.
Here’s a Python implementation using scikit-learn
:
from sklearn.linear_model import LinearRegression
import numpy as np
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 3, 5, 7, 11])
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Make predictions
predictions = model.predict(X)
print(predictions)
Logistic Regression
Logistic regression is used for binary classification problems. It estimates probabilities using the logistic function:
\[p = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x))}\]where:
- \(p\) is the probability of the dependent variable being 1,
- \(\exp\) denotes the exponential function.
Python implementation with scikit-learn
:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
# Load dataset
data = load_iris()
X = data.data
y = (data.target != 0) * 1 # Convert to binary problem
# Create and fit the model
model = LogisticRegression()
model.fit(X, y)
# Make predictions
predictions = model.predict(X)
print(predictions)
Decision Trees
Decision trees split the data into subsets based on feature values, creating a tree-like model. The decision rule at each node is based on a criterion like Gini impurity or entropy.
Python code for a decision tree classifier:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
# Load dataset
data = load_iris()
X = data.data
y = data.target
# Create and fit the model
model = DecisionTreeClassifier()
model.fit(X, y)
# Make predictions
predictions = model.predict(X)
print(predictions)
K-Means Clustering
K-means clustering partitions data into K clusters by minimizing the variance within each cluster. The algorithm iterates between assigning data points to the nearest cluster center and updating the cluster centers.
Python implementation:
from sklearn.cluster import KMeans
import numpy as np
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
# Create and fit the model
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
# Print cluster centers and labels
print("Cluster centers:\n", kmeans.cluster_centers_)
print("Labels:\n", kmeans.labels_)
Conclusion
Machine learning algorithms provide powerful tools for analyzing and making predictions from data. By understanding and implementing these algorithms, we can harness their potential to solve a wide range of problems. The provided Python code snippets offer a starting point for exploring these concepts and applying them to real-world datasets.