Machine Learning Basics with scikit-learn
Introduction
Machine learning is the science of programming computers to learn from data. It is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being specifically programmed to do so.
In this tutorial, you will be introduced to machine learning basics using one of the most popular Python libraries for machine learning, scikit-learn. Scikit-learn is a free software machine learning library that is built on NumPy, SciPy, and matplotlib. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistent interface.
Installing Scikit-Learn
To start using Scikit-learn, you need to install it first. You can do it by running the following pip command:
pip install -U scikit-learn
Loading an Example Dataset
Scikit-learn comes with a few standard datasets, for instance, the iris and digits datasets for classification and the Boston house prices dataset for regression.
Let's load the iris datasets:
from sklearn import datasets
iris = datasets.load_iris()
In the example above, the iris dataset is loaded from the datasets module. The iris object that is returned by load_iris()
is a Bunch
object, which is very similar to a dictionary. It contains keys and values:
print(iris.keys())
Splitting the Data
The next step is to split our dataset into its attributes and labels. To do this, use the following code:
X = iris.data
y = iris.target
In the code above, the iris.data
contains the attributes (e.g., sepal width, petal width, etc.) and iris.target
contains the labels (e.g., species of the flowers).
Training the Model
Once the data is split into attributes and labels, the final step is to train our machine learning model. In this case, we will use the K-nearest neighbors (KNN) algorithm.
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
Making Predictions
Once the model is trained, we can make predictions. Scikit-learn provides a predict()
function for this purpose.
predictions = knn.predict(X_test)
Evaluating the Model
The final step is to evaluate how well our model is doing. The metrics module for scikit-learn has a classification_report
function, which can be very handy:
from sklearn import metrics
print(metrics.classification_report(y_test, predictions))
In the code above, classification_report
is a method that returns an evaluation of the classification by calculating the precision, recall, F-measure and support.
Conclusion
That's it! You've just trained your first machine learning model using the scikit-learn library in Python. You've learned how to load a dataset, split it into attributes and labels, train a model, make predictions and evaluate its performance. But this is just the tip of the iceberg - scikit-learn has many more features and functionalities, which you will touch upon as you dive deeper into machine learning. Happy learning!