Building a Sequence Classifier with Machine Learning and a Slick GUI

Ashar Ahmed
3 min readNov 4, 2023

--

I recently worked on an interesting machine learning project — developing a model to classify DNA, RNA and protein sequences. In this post, I’ll walk through how I built the model using Python and created an intuitive graphical user interface (GUI) to bring it to life.

Code

https://github.com/asharahmed1/sequence_classifier

Reading and Prepping the Data

The first step was getting the raw sequence data into a usable format. The files were in FASTA format — basically just the sequences labeled with a “>” character. I wrote a function to parse these files and extract each sequence into a list.

Next, I needed to convert the sequences into numeric features that could be fed into a machine learning algorithm. For this, I used one-hot encoding — converting each nucleotide or amino acid letter into a binary vector. ‘A’ becomes [1, 0, 0, 0], ‘C’ becomes [0, 1, 0, 0] and so on.

My encoding function condenses each sequence into a fixed length feature vector, padding shorter sequences and truncating longer ones. This prepared the data for training.

Choosing and Training a Model

With the sequence data numerically encoded, I could start training models. I experimented with a few different classifiers — SVMs, random forests and neural networks. The SVM (support vector machine) performed best, so I went with that.

I trained the SVM on the feature vectors from positive and negative example sequences. The model learned how to distinguish between the two classes based on the encoded sequence patterns.

Building an Interactive GUI

So I had a sequence classifier, but I wanted to make it more accessible. That’s where the graphical user interface came in. I used Tkinter to build a simple window with an input box, buttons and result labels.

A user can paste or type a sequence into the input. When they hit “Classify”, it triggers the feature encoding, runs this through the SVM, and displays the prediction and the model’s confidence.

To improve the user experience, I also added functions to detect if the input was DNA, RNA or a protein sequence before classification. This provides helpful context on what type of sequence the user entered.

Testing and Iterating

With the initial GUI built, I worked to refine the model and interface. I tested it on various sample sequences to identify any classification errors. When I found issues, I would debug the encoding, re-train the model on more data, or tweak the SVM hyperparameters.

I also iterated on the GUI — adjusting the visual layout, adding sample inputs for users to try, and providing more debugging information on the classification process. These iterations improved robustness and usability.

Results

In the end, I had built an easy-to-use tool for sequence classification powered by machine learning. The SVM model can accurately categorize a diverse range of input sequences based on subtle patterns in the encoded features. And the graphical interface makes it accessible to anyone without coding experience.

Overall this was a W project. Lots of fun.

--

--

Ashar Ahmed

Professional investigator of nerdy stuff. Hacks and secures.