How to approach a text classification problem (part 1/3)

David Morcuende
6 min readFeb 10, 2021

--

In this post I will show you how to approach a text classification problem using a dataset I was given to analyse.

I was provided with a .zip and the only instructions were: “there are 7 folders, inside each folder there are some files, the name of the folder represent the class”.

I will split hte solution intro 3 parts:

  • EDA (Exploratory data analysis): we will take a look into the files and the data.
  • Modeling: we will define some baseline models with scikit-learn and then try to improve them with Tensorflow.
  • Deploy: we will deploy the best model (local & AWS).

Some text example:

Some File names:

EDA

First of all we need to set a seed to make the process reproducible and import the needed classes to work with files and load the data.

The files are in a folder called dataset in the same folder of this notebook

['exploration',
'headhunters',
'intelligence',
'logistics',
'politics',
'transportation',
'weapons']

As they anticipated, there are 7 classes

Reading text

We know there is text in the files, but we don’t know the extension of the files so we are going to iterate over the folders and files to get the text and save it in a dataframe in 2 columns: the text and the class it belongs to.

Wall time: 5.52 s
png

Let’s check for null data that we have not been able to read

df.shape(3893, 2)df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3893 entries, 0 to 3892
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 3893 non-null object
1 target 3893 non-null object
dtypes: object(2)
memory usage: 61.0+ KB
df.isnull().sum()text 0
target 0
dtype: int64

Missingno is a python data visualization library that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset.

import missingno as msno
msno.matrix(df,color=(0.3,0.36,0.44))
<AxesSubplot:>
png

We check that we have 3893 text files and that we have not had any errors when reading the content, at least there is some text in them.

Let’s see how these texts are distributed among the different classes

png
png

The “politics” class is the one with the most files (609, representing 15.6% of all files) and the one with the least is “intelligence” (465, representing 11.9% of all files)

It is appreciated that there is not a great imbalance between the different classes.

Next we will see how the characters and words are distributed within the files, for this task, we will create a function with parameters to be able to reuse it in the data cleaning process.

png

As can be seen in the graphics there is a lot of imbalance between the files, there are some that have many characters, but the vast majority are below 20,000

print_char_word(df,"text",True)
png

The same thing happens with words, there are files with many words but the vast majority are below 2,000

In addition to the characters and words, we will see if there are special characters and stop words (empty words, without meaning) since it will only add noise to the subsequent classification, for this we will create a function that for each class counts the number of special characters or stop words.

png

As can be seen there are a considerable number of special characters, especially hyphens “-” and periods “.”

For the stop words we will use the stops words from the nltk library, a natural language processing library for python.

[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\video\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
png

We graph one independently to be able to analyze an example since in the set of 7 it is more difficult to appreciate details.

<BarContainer object of 179 artists>
png

As with the special characters, there is a very high number of stop words.

For a better text processing we are going to lower case everything to regularize the words and make sure that 2 words are not different by the mere fact of starting with upper case and delete a series of elements:

  • Urls
  • HTML markup
  • Punctuation symbols
  • Numbers
  • Stopwords
  • Spaces

In that order, to correctly delete everything.

Processing Text

png
plot_char_freq("clean_text", string.punctuation, character=True, title='Caracteres especiales')
png
plot_char_freq("clean_text", STOPWORDS, character=False, title='Stop words')
png
print_char_word(df,"clean_text",True)
png

As we can see, there are no longer punctuation symbols or stop words and the number of words has dropped considerably, in the order of 500 words (although we still have texts with a number of words way above the average)

After the cleaning, we check if there are empty texts and we convert them to null to be able to eliminate them easily.

We also check for duplicates and remove them so there is no redundancy

df = df.replace(r'^\s*$', np.nan, regex=True)np.where(df.applymap(lambda x: x == ''))(array([], dtype=int64), array([], dtype=int64))df.isnull().sum()text          4
target 0
clean_text 5
dtype: int64
duplicate = df[df.duplicated()]
duplicate
png
df = df.dropna()
df = df.drop_duplicates()

Having cleaned the data and eliminated the nulls and duplicates, we are going to prepare the categories to be classified, for this we are going to change the “target” column to a category column to easily convert it to numeric categories.

df["target"] = df["target"].astype('category')
df["target_cat"] = df["target"].cat.codes
df.head()
png

In the next post we will explore models using the scikit-learn library to have some base models, we will build models with Tensorflow and we will compare them to see which one works best.

There will also be an annex explaining what the Embeddings are.

Code (soon)

Contact

Buy me a coffe

--

--