In the previous article we learned that sound data has features such as intensity, frequency, time period, phase angle, angular velocity, and how the sound data formed. There is some compression in the air molecules inside sound data, and changes in the pressure of any material can cause some changes in the sound produced.
Here we are going to classify different voices on the basis of the label and sound data using a deep convolutional network.
Sound classification is used in a variety of applications. All sounds are classified on the basis of the category. I already have one article on voice data analysis, Link, where we touched on various basic parameters like spectrogram, timbre, etc., but here we will discuss classifying our sound data.
There are steps which we follow for data classification.
Loading sound data using librosa library,
Converting sound data into numerical vector spectrograms,
Building deep neural network,
Predicting the label of sound data.
Like other classification algorithms, such as machine learning, image classification and feature classification are the same as sound classification. Here we also build CNN architecture.
In the above flow chart we see that when we get raw data in the form of an MP3 or other format, we use machine learning libraries to extract intensity build spectrograms from these voice data files and build machine learning model trains using spectrogram data.
Before proceeding further with the deep learning model we will import all the useful machine learning and deep learning libraries.
1 2 3 4 5 6 7 8 9 10 11 12
import tensorflow as tf import tensorflow_datasets as tfds from tensorflow.keras.layers import Input, Lambda, Conv2D, BatchNormalization from tensorflow.keras.layers import Activation, MaxPool2D, Flatten, Dropout, Dense from IPython.display import Audio from matplotlib import pyplot as plt from tqdm import tqdm
After importing all the useful libraries we will load the complete dataset by just calling the folder of all the files.
1
dataset = tfds.load("C:\Users\Shubham\Desktop\Topcoder challenges\torque prediction\training\training")
We have already seen the data visualization so we will move directly to data processing. Here we get data in sound form that we have to convert into a numerical data form because our computations are done in the numerical form. We must also scale the numerical features between a range, like log scaling, etc.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sr = 22050
chunk = 5
def preprocess(ex):
audio = ex.get("audio")
label = ex.get("label")
x_batch, y_batch = None, None
for i in range(0, 6):
start = i * chunk * sr
end = (i + 1) * chunk * sr
audio_chunk = audio[start: end]
audio_spec = spectrogram(audio_chunk)
audio_spec = tf.expand_dims(audio_spec, axis = 0)
current_label = tf.expand_dims(label, axis = 0)
x_batch = audio_spec
if x_batch is None
else tf.concat([x_batch, audio_spec], axis = 0)
y_batch = current_label
if y_batch is None
else tf.concat([y_batch, current_label], axis = 0)
return x_batch, y_batch
After extracting numerical value from the data we will split it into training forms and reshape our data for fitting into our deep learning model. There may be a need for
creating test size and train size.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
x_train, y_train = None, None
for ex in tqdm(iter(train)):
x_batch, y_batch = preprocess(ex)
x_train = x_batch
if x_train is None
else tf.concat([x_train, x_batch], axis = 0)
y_train = y_batch
if y_train is None
else tf.concat([y_train, y_batch], axis = 0)
indices = tf.random.shuffle(list(range(0, 768)))
x_train = tf.gather(x_train, indices)
y_train = tf.gather(y_train, indices)
n_val = 300
x_valid = x_train[: n_val, ...]
x_valid = x_train[: n_val, ...]
x_train = x_train[: n_val, ...] y_train = y_train[: n_val, ...]
Here we have completed the tasks for data processing and data splitting. Now we will move to model creation and we will create a deep convolutional network for data classification. I’ve illustrated this below with a diagram.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
input_ = Input(shape = (129, 212))
x = Lambda(lambda x: tf.expand_dims(x, axis = -1))(input_)
for i in range(0, 4):
num_filters = 2 ** (5 + i)
x = Conv2D(num_filters, 3)(x)
x = BatchNormalization()(x)
x = Activation("tanh")(x)
x = MaxPool2D(2)(x)
x = Flatten()(x)
x = Dropout(0.4)(x)
x = Dense(128, activation = "relu")(x)
x = Dropout(0.4)(x)
x = Dense(1, activation = "sigmoid")(x)
model = tf.keras.models.Model(input_, x)
Above we can see that we have created the first layer as the convolution layer, then we insert the batch normalization, this is to avoid overfitting the model. Then we insert the max pooling layer. Here we take a layer to scan our feature and downsize the image. Lastly, we apply the activation function to work as a threshold for the feature.
1
2
3
4
5
6
model.compile(
loss = "binary_crossentropy",
optimizer = tf.keras.optimizers.Adam(learning_rate = 3e-6),
metrics = ["accuracy"]
)
model.summary()
Now, after building the model we will insert our train features and label and perform iteration on it using the epoch and batch size.
1
2
model.fit(x_train, y_train, validation_data = (x_valid, y_valid), batch_size = 12,
epochs = 500, verbose = False])
After training our model we will test it on some data points and features.
1 2
x_test, y_test = preprocess(ex) preds = model.predict(x_test)