Finding Identical Lyrics using Modern NLP

Classifying similarly dubbed songs across languages automatically

Pavan B Govindaraju
5 min readFeb 4, 2024

At the risk of making this blog look more like my other one, I’m going to talk about movie lyrics in this post. One thought which always plagued me was lazy translations of songs between Telugu and Tamil songs, to the point where they are almost identical word-for-word. This also might be behind why some people prefer to listen to the original ones as the translated versions never really went along with the music. There were a few examples I have noticed myself and wanted to see if I could further automate this process.

Why are you so awesome? (Source: Wikimedia)

This turned out to be an arduous task as there is no definitive source of lyrics for both languages. More importantly, performing similarity searches across languages is a task in itself, although recent improvements have made it feasible.

First, I aggregated the top A.R Rahman songs from the 90s and early 2000s using GPT-3.

This gave me a reasonably good table which required further corrections. One such entry was “Margazhi Thingal”, which had to be removed because the movie was such a disaster that a dubbed version was never made. Another faulty entry was an Ilaiyaraja composition “Thendral Vanthu”, which is a classic in itself, but not in accordance with the prompt specified. The output which served as the eventual starting point is as follows:

Table 1: Summary of the top 20 songs of A R Rahman from the 90s along with their corresponding versions in Telugu and Tamil

Next, the aggregation of lyrics for all these songs had to be performed. Unfortunately, neither GPT-3.5 nor GPT-4 were able to provide accurate lyrics from their memory as it is copyrighted material. The current feasible approach is to use the repository or wallow a bit on search engines and find the most appropriate source for each song.

So the approach used was
Lyrics -> Transliteration (using AI4Bharat Transliteration [1]) -> Vector Embedding (using LaBSE [2]) -> Cosine Similarity

The following is a summary of the similarity scores for the above approach:

Table 2: Summary of the top 20 songs of A R Rahman from the 90s along with their corresponding versions in Telugu and Tamil and the similarity scores obtained using the tool

It can be seen in Table 2 that the similarity score tapers off at around 80+%. Based on these results, I’ve done a deep dive into the top 4 songs. Noticeably, they belong to movies that did reasonably at best in the box office, and that perhaps inspired their half-hearted translation.

Video 1: Song from “Amrutha” that scored highest in similarity to the analogue in Tamil

The first song in Table 2, as shown in Video 1, had very similar lyrics to its dubbed version on further analysis. The sentences are also simple and 3-worded, making for easy word-for-word translation. Next in line is “O Maria”, which has many words in English in both versions and contributes to their similarity.

On the other hand, two songs “Pudhu Vellai Mazhai” and “Nenjinile Nenjinile (Jiya Jale)” are shown to be dissimilar to the dubbed versions. These were famous hits and made sense that dedicated versions be made for each of the languages. Even the titles are dissimilar across languages for these songs and it was expected that they were classified as such based on this tool, although the latter had a common Malayalam verse and was originally shot for Hindi.

Take It Easy

One outlier in the above post, which I had to get to the bottom of was “Urvasi Urvasi”, which was thought to be similar, but had more nuances on further observation.

Video 2: Song from “Kadhalan” that is structurally and verbally similar to the Telugu version, but scores low presumably due to language differences

Surprisingly, it shows up much below and although structurally, the songs are the same in both languages, a few key phrases are language-specific and contribute to the difference.

Telugu: O cheli telusa telusa, Telugu maatalu padivelu
Tamil: Pesadi rathiye rathiye, Tamizhil vaarthaigal moondru laksham

The above snippet illustrates some of the subtle differences in why those two songs are very similar but score differently. The first one talks about 10,000 words in Telugu, whereas the latter talks about 300,000 words in Tamil. Other differences such as:

Telugu: thindi dandagani nanna ante, take it easy policy
Tamil: thanda sorunnu appan sonna, take it easy policy

Although the above two phrases convey the same meaning, the use of idioms is something the tool does not recognise. “தண்டச்சோறு: ThaNDA soru” — is something that means someone who is a waste of food, but since there is no corresponding phrase in Telugu, the literal translation is used, although it is common in colloquial language.

Summary

This article utilizes modern advancements in Natural Language Processing (NLP) to find similar lyrics across languages in dubbed versions. A Python-based tool was developed that utilized language-agnostic high-dimensional vector embeddings and deep-learning-based transliteration to bring the texts to a common ground and perform a comparison. The famous works of A. R. Rahman in Telugu and Tamil were particularly studied and songs that were very similar across languages were correctly identified by the developed tool. Some interesting observations were made with songs that were previously thought to be similar but correctly identified as different due to nuances such as differences in idioms. Although the tool is not shown to be fool-proof and the similarity score is subject to change based on lyrics source, obvious counter-examples to its functionality have not been found in this work and further discussions could be made in that direction.

References

[1] Madhani, Y., Parthan, S., Bedekar, P., Khapra, R., Seshadri, V., Kunchukuttan, A., … & Khapra, M. M. (2022). Aksharantar: Towards building open transliteration tools for the next billion users. arXiv preprint arXiv:2205.03018.

[2] Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852.

Acknowledgements

Many thanks to discussions with Prafulla Chandra A and Sankeerth Rao K on bringing this to fruition

Code

The code that was used for this analysis is available here. The exact lyrics files used for the comparison have not been uploaded as they are subject to copyright.

--

--

Pavan B Govindaraju
Pavan B Govindaraju

Written by Pavan B Govindaraju

Specializes in not specializing || Blogging about data, systems and tech in general

No responses yet