Exploratory Community

How to make groups by using information from two columns

#1

Here’s a question from our user.


I have a list of 50,000 physicians that are not labeled by what year they are in training. The physicians are listed in consecutive order by a numerical userid number.

Current Data:
Name Userid number City/State
Physician Name A - 1 - New York, NY
Physician Name B - 2 - New York, NY
Physician Name C - 3 - Washington, DC

To break out by the groups I need people who have consecutive doctor id numbers and have the same city state.

Desired Output:
Name Userid number City/State Group
Physician Name A - 1 - New York, NY Group x (First person)
Physician Name B - 2 - New York, NY Group x (Userid is plus one more than the last and same city with New York, NY)
Physician Name C - 3 - Washington, DC Group y (Userid is plus one more than the last BUT not the same city as New York, NY)


#2

You can use Arrange step to sort the data and show the data the way you like. You want to select City/State column first, then select Userid column.

If you want to create a Group column, you can simply use City/State column as the group name since that’s the only condition for separating the group.

If you want to give an id to the group name, then you can use ‘group_indices’ function. You want to ‘group’ the data frame first by CityState column, then use Mutate step to type the function like the below.

Group

Mutate

This will populate Group Index column with unique group id.