I have one column called PMID which is a unique identifier for scientific publications. I want to write an advanced filter in which I could ask Exploratory to give me all PMIDs where the Authors column is equal to “G* Perry” (using a wildcard after the first initial). This would bring back a list of about 40 PMIDs. I would then like to get a de-duplicated list of the names of all the coauthors of those 40 PMID (because some of the coauthors will have been on more than one of the 40 publications). This will return about 400 unique strings from the Authors column . Finally I would like to filter the data frame to contain all only those rows where the Authors column was equal to one of the 400 unique strings (probably giving me something like 2000 unique PMIDs). Thanks for any guidance you can give me.
Hi Aaron,
Here’s what you can do.
- Create a branch where you filter the data and de-duplicate.
The regular expression you want to use is something like below.
G.* Perry
and you can find ‘distinct’ command to ‘de-duplicate’ from here.
- Use Semi Join to filter the main data by the branch data.
You can see how you can filter the data using Semi Join here.
Let me know how it goes.
Cheers,
Kan
@Kan_Nishida,
Thanks a lot. The combination of the semi-join with the realization that various intermediate work products can be saved and then re-used via branching (in the end I created three separate branches) was the breakthrough I needed!
Aaron