Exploratory Community

How to parse words combo?

#1

Hi, hope all is well.

I’m trying to analyze domain names. Does anyone know how to parse a name with 2 or 3 English dictionary words using Exploratory?

For example,

  1. AirTable = Air, Table
  2. SeoNinja = Seo, Ninja
  3. BigFatDollar = Big, Fat, Dollar

Appreciate your feedback, thanks.

#2

Hi @CC_LEE

If there is a specific pattern in the string like the example you showed, I am not sure if it can be completely split into ‘dictionary words’, but I think it can be done to some extent.

In the example you showed, there is a comma and a space before the uppercase letters, so if this pattern is present, it can be split using the following regular expression.

  • col option > Work With Text Data > Replace > Text(All)
From: (?<!^)(?=[A-Z])
To: , 
※ Note; there is a space after the comma

I hope this helps.

#3

I will give it a try, many thanks.

There is no specific pattern other than the extension (.com), please see a sample below.
apartmentlocators.com
infrastar.com
tekh.com
moneymarketplace.com
easybtc.com
thelens.com
mobilepixels.com
collaborativeclassroom.com
quickgym.com
behard.com
emojify.com
menuease.com
btoo.com
bizcashadvance.com
haroldsauto.com
aircraftsale.com
aurayoga.com
redeemerbaptist.com

#4

Hi @CC_LEE

There is no specific pattern other than the extension (.com), please see a sample below.
If there is no specific pattern, then it is difficult to split a string with specific conditions.

In this situation, I think it would be better to use a string search algorithm(KMP, BM, …many), which allows you to give the keywords you want to search for to a string and find out if it contains the keywords or not.

As far as I know, Exploratory doesn’t provide such a feature by default, so you need to either write the algorithm from scratch or write an R script by combining R packages.

After some research(I’m sure there are many more if you look for them.), I found an R package that implements the AhoCorasick method. By combining these packages, you can write a function that outputs the search results in a format that is acceptable to Exploaratoy.

library(AhoCorasickTrie)
target_text <-  c("bizcashadvance", "moneymarketplace", "apartmentlocators",  "teststring")

# Normally, you would not specify the words, but use the dictionary data where a huge number of words are stored.
target_words <- c("money", "market", "place", "cash", "advance",   "apple", "apartment")

AhoCorasickSearch(keywords = target_words, text = target_text)

[[1]]
[[1]][[1]]
[[1]][[1]]$Keyword
[1] "cash"

[[1]][[1]]$Offset
[1] 4

[[1]][[2]]
[[1]][[2]]$Keyword
[1] "advance"

[[1]][[2]]$Offset
[1] 8

[[2]]
[[2]][[1]]
[[2]][[1]]$Keyword
[1] "money"

[[2]][[1]]$Offset
[1] 1

[[2]][[2]]
[[2]][[2]]$Keyword
[1] "market"

[[2]][[2]]$Offset
[1] 6

[[2]][[3]]
[[2]][[3]]$Keyword
[1] "place"

[[2]][[3]]$Offset
[1] 12

[[3]]
[[3]][[1]]
[[3]][[1]]$Keyword
[1] "apartment"

[[3]][[1]]$Offset
[1] 1

[[4]]
list()

I hope this helps.

#5

Thanks, Sugiaki. This is way too difficult for me, but I truly appreciate your feedback.