ANOVA and Regression

To find out if the evolution of the audio features are statistically significant, ANOVAs and linear regressions were performed.

A one-way analysis of variance (ANOVA) treats the decades as categorical variables and informed us whether the differences in the sample (the Top 20 US Billboard songs over the past 50 years) can be generalised as population differences. It will confirm if the change is statistically significant or just due to random variation (Korstanje, 2019). However, the ANOVA can not confirm the direction of the difference (whether it is an increase or decrease).

The linear regression treats the decades as continuous variables. This allowed us to show if there is a trend in how the audio features have changed through time. The R-squared value (also known as the coefficient of determination), is a statistical measure of how close the data is to the regression line. More specifically, it determines the proportion of variance in the dependent variable (in our case the audio feature) that can be explained by the independent variable (time) (Glen, 2020). The value can range from 0 to 1. For example, 0% to 100% of the variation in the dependent variable (audio features) can be explained by the independent variable (time) (Glen, 2020). The R-squared value is useful as it can measure the likelihood of future events falling within the predicted outcome (Glen, 2020). In our case, if time goes by, the value would tell us the probability of a hit song’s audio features being on the regression line.

Three audio features (time signature, acousticness and liveness) have been omitted from this analysis. This is because time signature is not measured with a continuous scale, and acousticness and liveness proved to be a feature which would result in little and insignificant discussion due to the vague nature of the feature and their measure, leading them to have a negligible R-squared value.

Danceability

The ANOVA between danceability and the decades shows a statistically significant difference (p < 0.05), which the graph confirms is an increase over time. It shows that popular songs have become more ‘danceable’ through time.

The R-squared value is 0.0363, showing that time does not seem to be the only predictive factor of why songs have become more danceable.

Tempo

As shown in the graph, as time goes by, the tempo of songs increases. However, as seen by the units, the tempo only increases by roughly 1 beat per minute from 1970 to 2020, which is not such a significant difference.

The R-squared value is 0.0002, confirming that the tempo of a song barely changes through the years. The ANOVA also suggests that there is no statistically significant difference between tempo and time (p > 0.05).

Length

As years go by, the length of hit songs has decreased by 17,500 milliseconds on average from 1970 to 2020. While it may seem like a lot, it is only 0.29 minutes (17.5 seconds). Regardless, the ANOVA confirms that there is a statistically significant difference over time (p < 0.01).

The R-squared value is 0.014, showing how the length of a song does not change that much because of time passing.

Instrumentalness

From the graph, we can see how instrumentalness has decreased throughout the years. The closer the value is to 1.0, the more likely the song has no vocal content. The opposite is occurring in this graph, where the vocal content is increasing through time. This change is statistically significant as seen in the ANOVA (p < 0.01).

The R-squared value is 0.012, showing how it is very unlikely that the change in the instrumentalness of songs was due to the passing of time.

Energy

The intensity and activity (energy) of hit songs has increased by around 0.2 throughout the years, which is a statistically significant difference according to the ANOVA (p < 0.01)

The R-squared value is 0.0599, showing how just because a hit song was more energetic does not mean it was due to the period it was recorded in.

Loudness

As years go by, the average loudness of a track (measured in decibels from -60 to 0 dB) increases by around 7 dB, which is a statistically significant change (p < 0.01).

The R-squared value is 0.3561, showing how nearly 40% of the variance of the loudness of a song can be explained by the passing of time.

Linear Regression showing the evolution of danceability through time

Speechiness

Linear Regression showing the evolution of tempo through time

Linear Regression showing the evolution of songs length through time

Linear Regression showing the evolution of Instrumentalness through time

Linear Regression showing the evolution of Energy through time

Linear Regression showing the evolution of Loudness through time

The graph shows that as years go by, songs have more spoken words instead of longer parts of instrumentation - a statistically significant difference (p < 0.01).

The R-squared value is 0.0886, showing how little time influences the speechiness of a song.

Linear Regression showing the evolution of speechiness through time

The code used to make these graphs can be found here

The code used to calculate the ANOVAs can be found here

Go to Line Graphs

Explore Other Visualizations

Home