How to merge two data frame based on partial string match with R?
Asked Answered
N

2

1

I have two data frames:

the first one contains a huge number of proteins for which I have made several calculations. here an example:

>Accession  Description # Peptides A2   # PSM A2    # Peptides B2   # PSM B2    # Peptides C2   # PSM C2    # Peptides D2   # PSM D2    # Peptides E2   # PSM E2    # AAs   MW [kDa]    calc. pI
P01837  Ig kappa chain C region OS=Mus musculus PE=1 SV=1 - [IGKC_MOUSE]    10  319 8   128 8   116 7   114         106 11,8    5,41
P01868  Ig gamma-1 chain C region secreted form OS=Mus musculus GN=Ighg1 PE=1 SV=1 - [IGHG1_MOUSE]  13  251 15  122 16  116 16  108         324 35,7    7,40
P60710  Actin, cytoplasmic 1 OS=Mus musculus GN=Actb PE=1 SV=1 - [ACTB_MOUSE]   15  215 10  37  11  30  11  31  16  154 375 41,7    5,48

the second contains the proteins of interest. here an example:

>complex    Description Accession   protein
TFIID   [TAF1_MOUSE]    Q80UV9-3    Isoform 3 of Transcription initiation factor TFIID subunit 1 OS=Mus musculus GN=Taf1 - [TAF1_MOUSE]
TFIID   [TAF2_MOUSE]    Q8C176  Transcription initiation factor TFIID subunit 2 OS=Mus musculus GN=Taf2 PE=2 SV=2 - [TAF2_MOUSE]
TFIID   [TAF3_MOUSE]    Q5HZG4  Transcription initiation factor TFIID subunit 3 OS=Mus musculus GN=Taf3 PE=1 SV=2 - [TAF3_MOUSE]

What I want to do: get one data frame containing the values from my calculations for the proteins of interest only. In a first attempt I used:

fusion <- merge.data.frame(x=tableaucleanIPTAFXwoNA, y=sublist, by.x="Description", by.y="protein", all =FALSE)

However, the nomenclature of the protein names are different between the two data frames and using the merge function this does not work.

So, how could I perform a partial match for "TAF10" when it is part of "Transcription initiation factor TFIID subunit 10 OS=Mus musculus GN=Taf10 PE=1 SV=1 - [TAF10_MOUSE]" string text ? In other words I want R recognizes only a piece o f the whole string.

I tried to use grep function:

idx2 <- sapply("tableaucleanIPTAFX$Description", grep, "sublist$Description")  

However, I got that:

as.data.frame(idx2)
[1] tableaucleanIPTAFX.Description
<0 rows> (or 0-length row.names)

I guess that, the pattern is not correctly recognized... Then I visited the RegExr website to write a regular expression so that my id names can be recognized. I found that this works to recognize [TRRAP_MOUSE] into

Transformation/transcription domain-associated protein OS=Mus musculus GN=Trrap PE=1 SV=2 - [TRRAP_MOUSE] :

with

 /(TRRAP_[MOUSE])\w+/g

I wonder how I can implement it to my id list (the "Description" column in my example) ?

Northway answered 11/1, 2016 at 11:31 Comment(1)
please provide some sample data.Independent
B
8

This might work for you and it handles duplicates:

First some dummy data:

df1 <- data.frame(name=c("George", "Abraham", "Barack"), stringsAsFactors = F)
df2 <- data.frame(president=c("Thanks, Obama (Barack)","Lincoln, Abraham, George""George Washington"), stringsAsFactors = F)

Find the code in the full description using grep:

idx2 <- sapply(df1$name, grep, df2$president)

This can result in multiple matches if multiple descriptions match the code so here I duplicate the original indices so the results align:

idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))

"merge" the datasets with cbind aligned on the new indices:

> cbind(df1[unlist(idx1),,drop=F], df2[unlist(idx2),,drop=F])
       name                president
1    George Lincoln, Abraham, George
1.1  George        George Washington
2   Abraham Lincoln, Abraham, George
3    Barack   Thanks, Obama (Barack)
Bissell answered 11/1, 2016 at 14:7 Comment(5)
Thank you @Zelazny7, this seems a good idea. However, when I tried to run grep into sapply() function, I got that : Error in FUN(X[[i]], ...) : invalid regular expression 'Superoxide dismutase [Cu-Zn] OS=Mus musculus GN=Sod1 PE=1 SV=2 - [SODC_MOUSE]', reason 'Invalid character range'. It seems that some characters are not correctly recognized.Northway
You have to escape the brackets like so: '\[Cu-Zn\]'Bissell
I have found this as a solution to a bit similar problem I have, but I find one issue. For some reason, when a value from df1 partially matches more than one value from df2, in the result table there is only 1 of these matches listed (seems to be random which one), while at the end of the table there are same number of extra rows as number of rows in df1, but instead of values for df1 they have only NAs and values from df2.Increase
For the above solution, how can I keep only those rows where the string match is maximum? I would like to drop duplicate rows with multiple matchesSara
If the final result is stored in a variable res: do.call(rbind, by(res, res$name, function(chunk) { i <- which.min(adist(chunk$name[[1]], chunk$president)) chunk[i,] }))Bissell
B
1

(Your question is a bit vague - it would be better with some sample/foobar data - so this answer unfortunately is too)

Try this:

?grep                                       # Pattern Matching and Replacement
X <- data.frame(a = letters[1:10])
grep(pattern = "c", x = X$a)                # returns position of "c": 3
grepl(pattern = "c", x = X$a)               # returns a vector of bools: [ F F T F F ... ]
X[grepl(pattern = "c", x = X$a),"a") <- "C" # replaces "c" with "C"

PS:

  • depending on how big / dirty your element names lists are, I've often found it useful to (i) create a clean (short and unambiguous) dictionary of names, (ii) add a new column with this new name to each original list and (iii) perform the merge with these columns;
  • aside from base::merge, I like to use dplyr's join functions (mostly because I like their cheat sheet);
Balbriggan answered 11/1, 2016 at 12:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.