Skip to main content

Machine Learning Basics with scikit-learn

Introduction

Machine learning is the science of programming computers to learn from data. It is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being specifically programmed to do so.

In this tutorial, you will be introduced to machine learning basics using one of the most popular Python libraries for machine learning, scikit-learn. Scikit-learn is a free software machine learning library that is built on NumPy, SciPy, and matplotlib. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistent interface.

Installing Scikit-Learn

To start using Scikit-learn, you need to install it first. You can do it by running the following pip command:

pip install -U scikit-learn

Loading an Example Dataset

Scikit-learn comes with a few standard datasets, for instance, the iris and digits datasets for classification and the Boston house prices dataset for regression.

Let's load the iris datasets:

from sklearn import datasets
iris = datasets.load_iris()

In the example above, the iris dataset is loaded from the datasets module. The iris object that is returned by load_iris() is a Bunch object, which is very similar to a dictionary. It contains keys and values:

print(iris.keys())

Splitting the Data

The next step is to split our dataset into its attributes and labels. To do this, use the following code:

X = iris.data
y = iris.target

In the code above, the iris.data contains the attributes (e.g., sepal width, petal width, etc.) and iris.target contains the labels (e.g., species of the flowers).

Training the Model

Once the data is split into attributes and labels, the final step is to train our machine learning model. In this case, we will use the K-nearest neighbors (KNN) algorithm.

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

Making Predictions

Once the model is trained, we can make predictions. Scikit-learn provides a predict() function for this purpose.

predictions = knn.predict(X_test)

Evaluating the Model

The final step is to evaluate how well our model is doing. The metrics module for scikit-learn has a classification_report function, which can be very handy:

from sklearn import metrics
print(metrics.classification_report(y_test, predictions))

In the code above, classification_report is a method that returns an evaluation of the classification by calculating the precision, recall, F-measure and support.

Conclusion

That's it! You've just trained your first machine learning model using the scikit-learn library in Python. You've learned how to load a dataset, split it into attributes and labels, train a model, make predictions and evaluate its performance. But this is just the tip of the iceberg - scikit-learn has many more features and functionalities, which you will touch upon as you dive deeper into machine learning. Happy learning!