merge command comparison between R and Stata

Asked 7/9, 2011 at 8:1 Answered 5/6, 2015 at 0:13

Being a R user, I am learning Stata now using this resource, and am puzzled about the merge command.

In R, I don't have to worry about merging data wrongly, because it merges everything anyway. I don't need to worry if the common columns contain any duplicates, because the Y dataframe will merge to each of the duplicated row in X dataframe. (using all=FALSE in merge)

But for Stata, I need to remove the duplicate rows from X before proceeding to merge.

Is it being assumed in Stata that, in order for merge to proceed, the common column in the master table must be unique?

Throttle answered 7/9, 2011 at 8:1 Comment(2)

for merging issues in Stata I find MMERGE really useful. – Byers 7/9, 2011 at 18:50

FYI: Starting with Stata 11, the features of mmerge have been incorporated in the "official" merge command. – Scotism 8/9, 2011 at 8:12

The answer to your question is No. I will try to explain why.

The link you mention covers only one type of merge that is possible with Stata, namely the one-to-many merge.

merge 1:m varlist using filename

Other types of merge are possible:

One-to-one merge on specified key variables

merge 1:1 varlist using filename

Many-to-one merge on specified key variables

merge m:1 varlist using filename

Many-to-many merge on specified key variables

merge m:m varlist using filename

One-to-one merge by observation

merge 1:1 _n using filename

Details, explanations and examples can be found in help merge.

If you do not know if observations are unique in a dataset, you can do the following check:

bysort idvar: gen N = _N

ta N

If you find values of N that are greater than 1, you know that observations are not unique with respect to idvar.

This is in fact the new syntax of the merge command that has been introduced with Stata 11. Before Stata 11, the merge command was a bit simpler. You simply had to sort your data, and then you could do:

merge varlist using filename

By the way, you can still use this old syntax in Stata 11 or higher.

Scotism answered 7/9, 2011 at 9:7 Comment(4)

Good job with a pretty comprehensive answer. Note that the older syntax was simpler but they changed it because it caused all sorts of hard-to-detect problems when your data was not as expected. Using the old syntax still works but returns a warning. – Commentate 7/9, 2011 at 10:9

@gsk3: Good comment. Personally, it took me some time to adopt the new merge syntax in my programs and classnotes. The new syntax looks and feels at a first glance more complicated. However, it will pay off quickly as it may draw your attention to problems in the data. – Scotism 7/9, 2011 at 10:53

It's a credit to Stata that they did something to make their language more difficult to understand at first but better in the long run. Particularly since most of their customers don't come from programming backgrounds and therefore are unlikely to immediately get how much things like this (perhaps most akin to strong typing) improves their lives :-) – Commentate 7/9, 2011 at 12:7

The command isid offers an easier way to test whether a variable is a unique identifier. – Ibadan 30/6, 2013 at 12:30

joinby, unmatched(both) is the command that corresponds to the R command merge.

In particular merge m:m DOES NOT do a many to many merge (ie full join) contrary to what the documentation implies.

Bowrah answered 5/6, 2015 at 0:13 Comment(0)

Recommended topics

Hot tags