Calculation of TF-IDF disconnects from associated data/columns

Michelle_Han · May 12, 2022, 11:04am

Hello, while trying to calculate TF-IDF, the final result ended up breaking up the connected columns so they no longer are associated with the right data/columns. So, while trying to determine the most important words for an article mentioned by readers, the words that come up have no connection to a particular reading but instead are words for a different reading, Why does this happen and how to fix this? I understand that we have to rejoin from a different data frame to recapture the entity name (https://blog.exploratory.io/demystifying-text-analytics-part-2-quantifying-documents-by-calculating-tf-idf-in-r-756955faa1ea) but I don’t understand why the associations break up when executing TF-IDF. Thank you.

Hideaki_Hayashi · May 13, 2022, 2:53am

Hi Michelle,

Could you elaborate a little more on the broken association you are talking about?
If you are talking about associations between the document and the words in it, they are represented by the document ID column kept in the output.

Michelle_Han · May 13, 2022, 4:52am

Thank you, I changed the column selection to document instead of ID and the data seems to align now, In my mind, they were equivalent but they are not the same when executing TF-IDF.
thank you so much!