I bet you all used to have an mp3 player. And after buying it, the first thing you did, was copy your favorite music file into it. You know, these files ending in “.MP3” or “.WAV” on your computer.
But what are these files and what are their contents? Before explaining what a sound/audio file is, I need to explain what sound is! You remember those boring physics lessons at school? Now, words like “Wavelength”, “frequencies”, “intensity” might be coming back to you, so let’s get started!
What is sound?
For humans, sound is something you can hear. Good. But for scientists, sounds are much more.
In physics, a sound is a vibration which moves through a medium. It’s a mechanical wave like those produced when you skim a stone.
But to travel through this medium, it must be a “medium with internal forces”.
This is a medium in which things can move : air and water are good examples. A vacuum is not this kind of medium because here things can neither move nor be moved.
So, we know that a sound is a vibration which moves through a medium. But how?
Think about the stone skimming again. First you see the ripples the stone makes. Then they become thinner over time, and the distance between peaks – wavelength – increases. In this case, the wavelength increases with time.
You probably know the word “Frequency”. Frequency is a kind of inverse of wavelength. Frequency is a ratio of 1 to the duration of one sound cycle . This unit is the Hertz (Hz). If a sound period is 0.25 seconds, the frequency will be 4Hz because 1/0.25 = 4.
Have you ever wondered why motorbikes sound increasingly treble when they’re approaching you and decreasingly treble when they are moving away from you? This is because the sound frequencies of the engine we can detect, are increasing and decreasing respectively!
So, wavelength modifies wave and so sound. But there are the other things that can modify sound, such as “Intensity” or “Amplitude”. Back to our stone skimming.
With time, the human eye cannot see the wave anymore. This is because of the wave intensity, which over time, decreases. It’s like a car on a straight road, eventually it will come to a stop when you stop accelerating. It’s the same fact for a wave. Intensity naturally decrease over time and then stone bounce wave can not be seen. So, sound wave can not be heard over time.
Now you know two sound properties. There are some others and if you want learn more, I encourage you to read /LINKS/.
My next question is : What is the difference between a treble sound and a bass sound? To keep it simply, a bass sound has a low frequency and a treble sound has a high frequency. This difference is called “pitch”. Humans can hear sounds between 20Hz and 20 000Hz. But dogs can hear sounds from 64Hz to 44 000Hz and cats from 55Hz to 77 000Hz!
Back to mp3 player and mp3/wav files. These files contain sound data, and we can think frequencies and intensities are moving while the audio track is playing. Let’s see it in more detail.
What’s an audio file?
At the beginning, I talked to you about “.MP3” or “.WAV” files. Let me explain what is the difference between these two audio files types.
First, let’s have a look at “.RAW” files.
“.RAW” files, created in the 20th century, are audio files which contain audio data “as-we-hear”. Data are uncompressed and rawed (good choice). To play audio data from these files, user must specify many properties: bits per sample, sample rate or number of channels (mono vs. stereo). It’s not very convenient for daily users.
After many years, in 1991, “.WAV” file was released, by Microsoft and IBM. Real name is “Waveform Audio File Format”. The file size is around 10 Megabits per minutes of sound with uncompressed data, which was huge in the 90’s. This file contains audio data but at the top of the file, there are some “headers” to indicate the music player software all properties : data type, bits per sample, sample rate or number of channels (mono vs. stereo), etc…
With this, the end user does not need to specify them like he must do with RAW files.
If we think for a minute, we can say : « Okay, so if I can specify the data type in “.WAV” file, can I put some others formatted data in a “.WAV” file? » The answer is : « YES ». You are allowed to put mp3 audio data in a “.WAV” file! The only thing you should take care of is to modify file headers to indicate the music player software it’s a mp3 data file.
At the beginning of the 21st Century, with the advent of the Internet and File Sharing, we looked for some other data format to reduce file size. Now it’s time to talk about compression!
“MP3” means “MPEG, audio layer 3”. “MPEG, audio layer 3″ is a scheme for the compression of audio signals. The MP3 audio file contains headers, like”.WAV” file, but the audio data are “compressed”.
“MPEG, audio layer 3” compression detects sound data that are inaudible for humans (you know, sound data not below 20Hz and above 20 000Hz) and removes them of the file. We now have a file with one-tenth the size of an equivalent “.WAV” file (eg. 1MB per minutes of sound). Today it’s the most used audio data format.
By clicking “play” on your mp3 player, sound data can be heard by humans. Your mp3 player analyzes the sound file and make headsets or speakers vibrate. Many sounds can be emitted from a speaker and the difference between a treble sound and a bass sound is the vibration produced by the speaker. So, signals transmitted to the speaker are not the same for a treble sound and a bass sound. Let’s check our file data!
How to deal with audio file in programming?
I want to talk about audio in programming, not in terms of playing sounds but recording sounds. At SAP Conversational AI, we can use audio files or sound streaming to perform Natural Language Processing on your voice. What we have to do when you press our streaming button is: record your voice, transmit data and then analyze your speech to give you back the result.
In our time, almost all devices have built-in microphone (smartphones, tablets, computers, and even your car!). The deal is to manage the connection between hardware and software. For many of these devices, brands implement software facilities for manipulating the hardware and receiving data from it (called Libraries, Frameworks or SDKs).
To explain how developer can work with files, let’s take the computer example. Did you ever seen this pop-up while navigating in the Internet?
By clicking on the “Allow” button, you authorize the website to capture the built-in microphone data. So, at this time, your voice and ambient sounds are temporarily saved in an array, until you click on “Stop record” or until the website stops recording. The developer now has, an array with many data. The array looks like this :
This array is called a “buffer”. If you look at each values of this array, we can see all values are greater than -1 and smaller than 1. By convention, a float number is an audio sample and the value of a float number is the amplitude of the audio signal.
When recording is stopped, your buffer is filled. To finish the process, and create a real “.WAV” file, you just need to set the “headers”, that match the recorded data format.
This data manipulation is called PCM (Pulse Code Manipulation). To learn more about PCM checkout the Wikipedia page : https://en.wikipedia.org/wiki/Pulse-code_modulation
For real, the data part of the file is not a human readable array but a sequence of these numbers, converted to binary format.
More information about binary data: https://www.nayuki.io/page/what-are-binary-and-text-files
And that’s it 🙂
All those processes and values of the array depend on hardware and audio specifications. Bitrate, bits per sample, number of channels and other sound properties make headers and data buffer change.
Many apps work with this data manipulation. All audio players, like iTunes and VLC, all audio editor softwares, like Audacity or Adobe Audition, and usually, all softwares playing or recording audio files can read file with PCM.
By the way, manipulating audio data in programming is easier than most of developers can think! At SAP Conversational AI, we now use audio file and speech streaming to perform NLP on your voice.
SAP Conversational AI can now hear you 🙂