How to approach a text classification problem (part 1/3)
In this post I will show you how to approach a text classification problem using a dataset I was given to analyse.
I was provided with a .zip and the only instructions were: “there are 7 folders, inside each folder there are some files, the name of the folder represent the class”.
I will split hte solution intro 3 parts:
- EDA (Exploratory data analysis): we will take a look into the files and the data.
- Modeling: we will define some baseline models with scikit-learn and then try to improve them with Tensorflow.
- Deploy: we will deploy the best model (local & AWS).
Some text example:
Some File names:
EDA
First of all we need to set a seed to make the process reproducible and import the needed classes to work with files and load the data.
The files are in a folder called dataset in the same folder of this notebook
['exploration',
'headhunters',
'intelligence',
'logistics',
'politics',
'transportation',
'weapons']
As they anticipated, there are 7 classes
Reading text
We know there is text in the files, but we don’t know the extension of the files so we are going to iterate over the folders and files to get the text and save it in a dataframe in 2 columns: the text and the class it belongs to.
Wall time: 5.52 s
Let’s check for null data that we have not been able to read
df.shape(3893, 2)df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3893 entries, 0 to 3892
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 3893 non-null object
1 target 3893 non-null object
dtypes: object(2)
memory usage: 61.0+ KBdf.isnull().sum()text 0
target 0
dtype: int64
Missingno is a python data visualization library that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset.
import missingno as msno
msno.matrix(df,color=(0.3,0.36,0.44))<AxesSubplot:>
We check that we have 3893 text files and that we have not had any errors when reading the content, at least there is some text in them.
Let’s see how these texts are distributed among the different classes
The “politics” class is the one with the most files (609, representing 15.6% of all files) and the one with the least is “intelligence” (465, representing 11.9% of all files)
It is appreciated that there is not a great imbalance between the different classes.
Next we will see how the characters and words are distributed within the files, for this task, we will create a function with parameters to be able to reuse it in the data cleaning process.
As can be seen in the graphics there is a lot of imbalance between the files, there are some that have many characters, but the vast majority are below 20,000
print_char_word(df,"text",True)
The same thing happens with words, there are files with many words but the vast majority are below 2,000
In addition to the characters and words, we will see if there are special characters and stop words (empty words, without meaning) since it will only add noise to the subsequent classification, for this we will create a function that for each class counts the number of special characters or stop words.
As can be seen there are a considerable number of special characters, especially hyphens “-” and periods “.”
For the stop words we will use the stops words from the nltk library, a natural language processing library for python.
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\video\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
We graph one independently to be able to analyze an example since in the set of 7 it is more difficult to appreciate details.
<BarContainer object of 179 artists>
As with the special characters, there is a very high number of stop words.
For a better text processing we are going to lower case everything to regularize the words and make sure that 2 words are not different by the mere fact of starting with upper case and delete a series of elements:
- Urls
- HTML markup
- Punctuation symbols
- Numbers
- Stopwords
- Spaces
In that order, to correctly delete everything.
Processing Text
plot_char_freq("clean_text", string.punctuation, character=True, title='Caracteres especiales')
plot_char_freq("clean_text", STOPWORDS, character=False, title='Stop words')
print_char_word(df,"clean_text",True)
As we can see, there are no longer punctuation symbols or stop words and the number of words has dropped considerably, in the order of 500 words (although we still have texts with a number of words way above the average)
After the cleaning, we check if there are empty texts and we convert them to null to be able to eliminate them easily.
We also check for duplicates and remove them so there is no redundancy
df = df.replace(r'^\s*$', np.nan, regex=True)np.where(df.applymap(lambda x: x == ''))(array([], dtype=int64), array([], dtype=int64))df.isnull().sum()text 4
target 0
clean_text 5
dtype: int64duplicate = df[df.duplicated()]
duplicate
df = df.dropna()
df = df.drop_duplicates()
Having cleaned the data and eliminated the nulls and duplicates, we are going to prepare the categories to be classified, for this we are going to change the “target” column to a category column to easily convert it to numeric categories.
df["target"] = df["target"].astype('category')
df["target_cat"] = df["target"].cat.codesdf.head()
In the next post we will explore models using the scikit-learn library to have some base models, we will build models with Tensorflow and we will compare them to see which one works best.
There will also be an annex explaining what the Embeddings are.