Strings

Standard functions
Regular Expressions
Package stringr
Package rex

This markdown present several ways to manipulate strings.

Standard functions

paste: pastes vectors together
substr: extract/replace substrings in a character vector
substring: expand cyclically several results
strsplit: split the elements into substrings according to the matches (uses regular expressions)

paste("Today is ", date())

## [1] "Today is  Mon May 30 15:03:14 2016"

xs <- 1:7
paste0("A", xs)

## [1] "A1" "A2" "A3" "A4" "A5" "A6" "A7"

paste("A", xs, sep=",")

## [1] "A,1" "A,2" "A,3" "A,4" "A,5" "A,6" "A,7"

paste(letters[1:10],xs,sep="|")

##  [1] "a|1" "b|2" "c|3" "d|4" "e|5" "f|6" "g|7" "h|1" "i|2" "j|3"

paste(letters[1:10],xs,sep="|",collapse=",")

## [1] "a|1,b|2,c|3,d|4,e|5,f|6,g|7,h|1,i|2,j|3"

cs <- "o mapa nao e o territorio"
paste0("'", cs, "' tem ", nchar(cs), " caracteres")

## [1] "'o mapa nao e o territorio' tem 25 caracteres"

substr(cs,3,6)

## [1] "mapa"

substr(cs,3,6) <- "MAPA"
cs

## [1] "o MAPA nao e o territorio"

substring(cs, 2, 4:6)

## [1] " MA"   " MAP"  " MAPA"

xs <- c("ontem", "hoje", "amanha", "depois de amanha")
substring(xs, 2) <- c("XX", "YY", "Z")
xs

## [1] "oXXem"            "hYYe"             "aZanha"          
## [4] "dXXois de amanha"

cs <- "o mapa nao e o territorio"
strsplit(cs,"[oa]")

## [[1]]
## [1] ""        " m"      "p"       " n"      ""        " e "     " territ"
## [8] "ri"

cs <- paste(letters[1:10],1:7,sep="|",collapse=",")
cs

## [1] "a|1,b|2,c|3,d|4,e|5,f|6,g|7,h|1,i|2,j|3"

cs1 <- strsplit(cs,"[,|]")[[1]]
cs1

##  [1] "a" "1" "b" "2" "c" "3" "d" "4" "e" "5" "f" "6" "g" "7" "h" "1" "i"
## [18] "2" "j" "3"

cs1 <- paste0(cs1,collapse="")
cs1

## [1] "a1b2c3d4e5f6g7h1i2j3"

Regular Expressions

strsplit(cs1,"[1-9]")   # use every digit as a separator

## [[1]]
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

strsplit("a.b.c",".")   # . is the operator that accepts all as separator

## [[1]]
## [1] "" "" "" "" ""

strsplit("a.b.c","\\.") # separates by the point

## [[1]]
## [1] "a" "b" "c"

cs <- c("aaa","abb","ccc","dda","eaa")
sub("a","X",cs)  # sub replaces the first match for the entries of a vector

## [1] "Xaa" "Xbb" "ccc" "ddX" "eXa"

gsub("a","X",cs) # the same but replaces all matches

## [1] "XXX" "Xbb" "ccc" "ddX" "eXX"

text.test <- "Evidence for a model (or belief) must be considered against alternative models. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you that the next dice throw will be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there's an alternative model ('just a lucky guess') that also explains it and it's much more likely to be the right model (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of statistical inference. It's crucial to consider the alternatives when we want to put our beliefs to the test."
gsub("belief|model","XXX",text.test) # erase every word equals to belief *or* model

## [1] "Evidence for a XXX (or XXX) must be considered against alternative XXXs. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you that the next dice throw will be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there's an alternative XXX ('just a lucky guess') that also explains it and it's much more likely to be the right XXX (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of statistical inference. It's crucial to consider the alternatives when we want to put our XXXs to the test."

gsub("t([a-z]*)?t","XXX",text.test) # erase every 0+ letters between 2 t's

## [1] "Evidence for a model (or belief) must be considered against alXXXive models. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you XXX the next dice throw will be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there's an alXXXive model ('just a lucky guess') XXX also explains it and it's much more likely to be the right model (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of sXXXical inference. It's crucial to consider the alXXXives when we want to put our beliefs to the XXX."

gsub("([a-z])\\1","YY",text.test)   # erase every letter repeated twice (eg: 'ee', 'll')

## [1] "Evidence for a model (or belief) must be considered against alternative models. Let me describe a neutral (and very simple) example: AYYume I say I have Extra Sensorial Perception (ESP) and teYY you that the next dice throw wiYY be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there's an alternative model ('just a lucky gueYY') that also explains it and it's much more likely to be the right model (because ESP nYYds much more aYYumptions, many of those in conflict with aYYepted facts and theories). This is a subject of statistical inference. It's crucial to consider the alternatives when we want to put our beliefs to the test."

gsub("(model)","*\\1*",text.test)   # bold every 'model' word

## [1] "Evidence for a *model* (or belief) must be considered against alternative *model*s. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you that the next dice throw will be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there's an alternative *model* ('just a lucky guess') that also explains it and it's much more likely to be the right *model* (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of statistical inference. It's crucial to consider the alternatives when we want to put our beliefs to the test."

gsub("(t)(h)","\\2\\1", text.test)  # swap every 'th' to 'ht'

## [1] "Evidence for a model (or belief) must be considered against alternative models. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you htat hte next dice htrow will be 1. You htrow hte dice and I was right. That is evidence for my claim of ESP. However htere's an alternative model ('just a lucky guess') htat also explains it and it's much more likely to be hte right model (because ESP needs much more assumptions, many of htose in conflict wiht accepted facts and hteories). This is a subject of statistical inference. It's crucial to consider hte alternatives when we want to put our beliefs to hte test."

gsub("([^a-zA-Z])([aA][a-z]+)","\\1*\\2*",text.test) # bold every word that begins with 'a'

## [1] "Evidence for a model (or belief) must be considered *against* *alternative* models. Let me describe a neutral (*and* very simple) example: *Assume* I say I have Extra Sensorial Perception (ESP) *and* tell you that the next dice throw will be 1. You throw the dice *and* I was right. That is evidence for my claim of ESP. However there's *an* *alternative* model ('just a lucky guess') that *also* explains it *and* it's much more likely to be the right model (because ESP needs much more *assumptions*, many of those in conflict with *accepted* facts *and* theories). This is a subject of statistical inference. It's crucial to consider the *alternatives* when we want to put our beliefs to the test."

gsub("([^a-zA-Z])([a-z]){1,3}([^a-zA-Z])","\\1ZZZ\\3",text.test)   # erase every word with 1 to 3 letters

## [1] "Evidence ZZZ a model (ZZZ belief) must ZZZ considered against alternative models. Let ZZZ describe ZZZ neutral (ZZZ very simple) example: Assume I ZZZ I have Extra Sensorial Perception (ESP) ZZZ tell ZZZ that ZZZ next dice throw will ZZZ 1. You throw ZZZ dice ZZZ I ZZZ right. That ZZZ evidence ZZZ my claim ZZZ ESP. However there'ZZZ an alternative model ('just ZZZ lucky guess') that also explains ZZZ and ZZZ's much more likely ZZZ be ZZZ right model (because ESP needs much more assumptions, many ZZZ those ZZZ conflict with accepted facts ZZZ theories). This ZZZ a subject ZZZ statistical inference. It'ZZZ crucial ZZZ consider ZZZ alternatives when ZZZ want ZZZ put ZZZ beliefs ZZZ the test."

# {3} means exactly 3 and {3,} means 3 or more
separators <- "[,.: ()']"
tokens <- strsplit(text.test, separators)[[1]]  # tokenize text into words
tokens <- tokens[tokens != ""]                  # remove empty tokens
tokens

##   [1] "Evidence"     "for"          "a"            "model"       
##   [5] "or"           "belief"       "must"         "be"          
##   [9] "considered"   "against"      "alternative"  "models"      
##  [13] "Let"          "me"           "describe"     "a"           
##  [17] "neutral"      "and"          "very"         "simple"      
##  [21] "example"      "Assume"       "I"            "say"         
##  [25] "I"            "have"         "Extra"        "Sensorial"   
##  [29] "Perception"   "ESP"          "and"          "tell"        
##  [33] "you"          "that"         "the"          "next"        
##  [37] "dice"         "throw"        "will"         "be"          
##  [41] "1"            "You"          "throw"        "the"         
##  [45] "dice"         "and"          "I"            "was"         
##  [49] "right"        "That"         "is"           "evidence"    
##  [53] "for"          "my"           "claim"        "of"          
##  [57] "ESP"          "However"      "there"        "s"           
##  [61] "an"           "alternative"  "model"        "just"        
##  [65] "a"            "lucky"        "guess"        "that"        
##  [69] "also"         "explains"     "it"           "and"         
##  [73] "it"           "s"            "much"         "more"        
##  [77] "likely"       "to"           "be"           "the"         
##  [81] "right"        "model"        "because"      "ESP"         
##  [85] "needs"        "much"         "more"         "assumptions" 
##  [89] "many"         "of"           "those"        "in"          
##  [93] "conflict"     "with"         "accepted"     "facts"       
##  [97] "and"          "theories"     "This"         "is"          
## [101] "a"            "subject"      "of"           "statistical" 
## [105] "inference"    "It"           "s"            "crucial"     
## [109] "to"           "consider"     "the"          "alternatives"
## [113] "when"         "we"           "want"         "to"          
## [117] "put"          "our"          "beliefs"      "to"          
## [121] "the"          "test"

grep("dice", tokens, fixed=TRUE)                # where are 'dice' tokens (returns indexes)

## [1] 37 45

string <- "abcedabcfaa"
cs <- strsplit(gsub("([a-z])","\\1,",string),",")[[1]] # convert to vector of chars
grepl("a",cs)   # TRUE if there's a match, FALSE otherwise

##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

# regexpr() gives you the first match in each element of the vector (-1 if not found)
# the second vector is the length of the first match
cs <- c("aaa", "axx", "xaa", "axx", "xxx", "xxx")
regexpr("a", cs)

## [1]  1  1  2  1 -1 -1
## attr(,"match.length")
## [1]  1  1  1  1 -1 -1
## attr(,"useBytes")
## [1] TRUE

regexpr("a*", cs)

## [1] 1 1 1 1 1 1
## attr(,"match.length")
## [1] 3 1 0 1 0 0
## attr(,"useBytes")
## [1] TRUE

# regexpr() gives the indexes of each sub-expression
cs <- c("123ab67","ab321","10000","0","abc")
regexec("[a-z]*([0-9]+)",cs)

## [[1]]
## [1] 1 1
## attr(,"match.length")
## [1] 3 3
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] 1 3
## attr(,"match.length")
## [1] 5 3
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 1 1
## attr(,"match.length")
## [1] 5 5
## attr(,"useBytes")
## [1] TRUE
## 
## [[4]]
## [1] 1 1
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[5]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE

# A more complex eg:
set.seed(101)
pop.data <- paste("the population is", floor(runif(20,1e3,5e4)),"birds")
head(pop.data)

## [1] "the population is 19237 birds" "the population is 3147 birds" 
## [3] "the population is 35774 birds" "the population is 33226 birds"
## [5] "the population is 13242 birds" "the population is 15702 birds"

reg.info <- regexec("the population is ([0-9]*) birds", pop.data)
reg.info[1:3]

## [[1]]
## [1]  1 19
## attr(,"match.length")
## [1] 29  5
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1]  1 19
## attr(,"match.length")
## [1] 28  4
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  1 19
## attr(,"match.length")
## [1] 29  5
## attr(,"useBytes")
## [1] TRUE

reg.data <- regmatches(pop.data, reg.info)
reg.data[1:3]

## [[1]]
## [1] "the population is 19237 birds" "19237"                        
## 
## [[2]]
## [1] "the population is 3147 birds" "3147"                        
## 
## [[3]]
## [1] "the population is 35774 birds" "35774"

bird.population <- sapply(reg.data, function(x)x[2])
bird.population

##  [1] "19237" "3147"  "35774" "33226" "13242" "15702" "29658" "17339"
##  [9] "31478" "27745" "44109" "35636" "36866" "46650" "23300" "29925"
## [17] "41201" "11981" "21171" "2891"

One more example (based on this one). This one shows how to draw grey rectangle over depressions in a time-series:

set.seed(1303)
# make time-series
steps <- sample(-2:2, size=200, prob=c(.1,.2,.2,.4,.1) ,replace=TRUE)
ts <- cumsum(steps)
plot(ts, type="l")

# assume we didn't know how ts was made
difs <- sign(diff(ts)>=0)  # 0 if decreased, 1 otherwise

bits <- paste0(difs,collapse="")  # colapse into a string of bits
bits

## [1] "1010101100111011011111100111111111111011100000000111110111001101101110111111001010111101100101111001100111011101101111100011011100111110100111111000011010000001010111111101111101111011111110100000011"

# let's signal a consecutive decrease of 2+ time stamps (aka, a depression)
matches <- gregexpr("00+", bits, perl = T)[[1]]
matches

##  [1]   9  24  42  59  77  90  98 102 120 129 138 146 154 192
## attr(,"match.length")
##  [1] 2 2 8 2 2 2 2 2 3 2 2 4 6 6
## attr(,"useBytes")
## [1] TRUE

attributes(matches)$match.length # this allows to access the length of each matches[i]

##  [1] 2 2 8 2 2 2 2 2 3 2 2 4 6 6

# let's plot the time series with the depressions marked as grey rectangles
plot(ts, type="n")
min.y <- rep(min(ts),length(matches))
max.y <- rep(max(ts),length(matches))
rect(matches, min.y, matches+attributes(matches)$match.length, max.y, col="lightgrey", border=FALSE)
points(ts, type="l")

Package `stringr`

[R Help] stringr is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions

library(stringr)

str1 <- c("o mapa")
str2 <- c("nao e o territorio")
str3 <- str_c(str1,str2, sep=" ")  # join 2+ strings
str3

## [1] "o mapa nao e o territorio"

str_c(letters, collapse = ", ")

## [1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z"

str_length(str3)

## [1] 25

str_dup("ab",5)                    # duplicates strings

## [1] "ababababab"

str_dup(c("ab","c"),3)

## [1] "ababab" "ccc"

str_dup("ab",1:3)

## [1] "ab"     "abab"   "ababab"

str_count(str3, "r")               # the number of matches

## [1] 3

str_detect(str3, "r")              # verifies if match exists

## [1] TRUE

str_extract(str3, "[it][eo]+")     # extract first match

## [1] "te"

str_extract_all(str3, "[it][eo]+") # extract all matches

## [[1]]
## [1] "te" "to" "io"

str_locate(str3, "[it][eo]+")      # locate where's first match

##      start end
## [1,]    16  17

str_locate_all(str3, "[it][eo]+")  # locate where're all matches

## [[1]]
##      start end
## [1,]    16  17
## [2,]    21  22
## [3,]    24  25

str_replace(str3,"r","R")          # replace first match

## [1] "o mapa nao e o teRritorio"

str_replace_all(str3,"r","R")      # replace all matches

## [1] "o mapa nao e o teRRitoRio"

str_split(str3,"e")

## [[1]]
## [1] "o mapa nao " " o t"        "rritorio"

str_split(str3,"e",n=2)

## [[1]]
## [1] "o mapa nao "   " o territorio"

str_sub(str3,1,3)                  # extract substrings

## [1] "o m"

str_sub(str3,seq(1,24,2),seq(2,25,2))

##  [1] "o " "ma" "pa" " n" "ao" " e" " o" " t" "er" "ri" "to" "ri"

str4 <- "BBCDEF"
str_sub(str4, 1, 1) <- "A"
str4

## [1] "ABCDEF"

str_sub(str4, -1, -1) <- "K"
str4

## [1] "ABCDEK"

strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569",
  "387 287 6718", "apple", "233.398.9187  ", "482 952 3315",
  "239 923 8115", "842 566 4692", "Work: 579-499-7527", "$1000",
  "Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_extract(strings, phone)

##  [1] "219 733 8965" "329-293-8753" NA             "595 794 7569"
##  [5] "387 287 6718" NA             "233.398.9187" "482 952 3315"
##  [9] "239 923 8115" "842 566 4692" "579-499-7527" NA            
## [13] "543.355.3679"

str_match(strings, phone)

##       [,1]           [,2]  [,3]  [,4]  
##  [1,] "219 733 8965" "219" "733" "8965"
##  [2,] "329-293-8753" "329" "293" "8753"
##  [3,] NA             NA    NA    NA    
##  [4,] "595 794 7569" "595" "794" "7569"
##  [5,] "387 287 6718" "387" "287" "6718"
##  [6,] NA             NA    NA    NA    
##  [7,] "233.398.9187" "233" "398" "9187"
##  [8,] "482 952 3315" "482" "952" "3315"
##  [9,] "239 923 8115" "239" "923" "8115"
## [10,] "842 566 4692" "842" "566" "4692"
## [11,] "579-499-7527" "579" "499" "7527"
## [12,] NA             NA    NA    NA    
## [13,] "543.355.3679" "543" "355" "3679"

rbind(
  str_pad("hadley", 10, "left"),
  str_pad("hadley", 10, "right"),
  str_pad("hadley", 10, "both")
)

##      [,1]        
## [1,] "    hadley"
## [2,] "hadley    "
## [3,] "  hadley  "

thanks_path <- file.path(R.home("doc"), "THANKS")
thanks <- str_c(readLines(thanks_path), collapse = "\n")
thanks <- word(thanks, 1, 3, fixed("\n\n"))
cat(str_wrap(thanks), "\n")

## R would not be what it is today without the invaluable help of these people,
## who contributed by donating code, bug fixes and documentation: Valerio Aimale,
## Thomas Baier, Henrik Bengtsson, Roger Bivand, Ben Bolker, David Brahm, G"oran
## Brostr"om, Patrick Burns, Vince Carey, Saikat DebRoy, Matt Dowle, Brian D'Urso,
## Lyndon Drake, Dirk Eddelbuettel, Claus Ekstrom, Sebastian Fischmeister, John
## Fox, Paul Gilbert, Yu Gong, Gabor Grothendieck, Frank E Harrell Jr, Torsten
## Hothorn, Robert King, Kjetil Kjernsmo, Roger Koenker, Philippe Lambert, Jan
## de Leeuw, Jim Lindsey, Patrick Lindsey, Catherine Loader, Gordon Maclean, John
## Maindonald, David Meyer, Ei-ji Nakama, Jens Oehlschaegel, Steve Oncley, Richard
## O'Keefe, Hubert Palme, Roger D. Peng, Jose' C. Pinheiro, Tony Plate, Anthony
## Rossini, Jonathan Rougier, Petr Savicky, Guenther Sawitzki, Marc Schwartz, Arun
## Srinivasan, Detlef Steuer, Bill Simpson, Gordon Smyth, Adrian Trapletti, Terry
## Therneau, Rolf Turner, Bill Venables, Gregory R. Warnes, Andreas Weingessel,
## Morten Welinder, James Wettenhall, Simon Wood, and Achim Zeileis. Others have
## written code that has been adopted by R and is acknowledged in the code files,
## including

cat(str_wrap(thanks, width = 40), "\n")

## R would not be what it is today
## without the invaluable help of these
## people, who contributed by donating
## code, bug fixes and documentation:
## Valerio Aimale, Thomas Baier, Henrik
## Bengtsson, Roger Bivand, Ben Bolker,
## David Brahm, G"oran Brostr"om, Patrick
## Burns, Vince Carey, Saikat DebRoy,
## Matt Dowle, Brian D'Urso, Lyndon Drake,
## Dirk Eddelbuettel, Claus Ekstrom,
## Sebastian Fischmeister, John Fox, Paul
## Gilbert, Yu Gong, Gabor Grothendieck,
## Frank E Harrell Jr, Torsten Hothorn,
## Robert King, Kjetil Kjernsmo, Roger
## Koenker, Philippe Lambert, Jan de
## Leeuw, Jim Lindsey, Patrick Lindsey,
## Catherine Loader, Gordon Maclean, John
## Maindonald, David Meyer, Ei-ji Nakama,
## Jens Oehlschaegel, Steve Oncley, Richard
## O'Keefe, Hubert Palme, Roger D. Peng,
## Jose' C. Pinheiro, Tony Plate, Anthony
## Rossini, Jonathan Rougier, Petr Savicky,
## Guenther Sawitzki, Marc Schwartz, Arun
## Srinivasan, Detlef Steuer, Bill Simpson,
## Gordon Smyth, Adrian Trapletti, Terry
## Therneau, Rolf Turner, Bill Venables,
## Gregory R. Warnes, Andreas Weingessel,
## Morten Welinder, James Wettenhall, Simon
## Wood, and Achim Zeileis. Others have
## written code that has been adopted by R
## and is acknowledged in the code files,
## including

cat(str_wrap(thanks, width = 60, indent = 2), "\n")

##   R would not be what it is today without the invaluable
## help of these people, who contributed by donating code,
## bug fixes and documentation: Valerio Aimale, Thomas Baier,
## Henrik Bengtsson, Roger Bivand, Ben Bolker, David Brahm,
## G"oran Brostr"om, Patrick Burns, Vince Carey, Saikat
## DebRoy, Matt Dowle, Brian D'Urso, Lyndon Drake, Dirk
## Eddelbuettel, Claus Ekstrom, Sebastian Fischmeister, John
## Fox, Paul Gilbert, Yu Gong, Gabor Grothendieck, Frank E
## Harrell Jr, Torsten Hothorn, Robert King, Kjetil Kjernsmo,
## Roger Koenker, Philippe Lambert, Jan de Leeuw, Jim Lindsey,
## Patrick Lindsey, Catherine Loader, Gordon Maclean, John
## Maindonald, David Meyer, Ei-ji Nakama, Jens Oehlschaegel,
## Steve Oncley, Richard O'Keefe, Hubert Palme, Roger D. Peng,
## Jose' C. Pinheiro, Tony Plate, Anthony Rossini, Jonathan
## Rougier, Petr Savicky, Guenther Sawitzki, Marc Schwartz,
## Arun Srinivasan, Detlef Steuer, Bill Simpson, Gordon
## Smyth, Adrian Trapletti, Terry Therneau, Rolf Turner, Bill
## Venables, Gregory R. Warnes, Andreas Weingessel, Morten
## Welinder, James Wettenhall, Simon Wood, and Achim Zeileis.
## Others have written code that has been adopted by R and is
## acknowledged in the code files, including

cat(str_wrap(thanks, width = 60, exdent = 2), "\n")

## R would not be what it is today without the invaluable
##   help of these people, who contributed by donating code,
##   bug fixes and documentation: Valerio Aimale, Thomas Baier,
##   Henrik Bengtsson, Roger Bivand, Ben Bolker, David Brahm,
##   G"oran Brostr"om, Patrick Burns, Vince Carey, Saikat
##   DebRoy, Matt Dowle, Brian D'Urso, Lyndon Drake, Dirk
##   Eddelbuettel, Claus Ekstrom, Sebastian Fischmeister, John
##   Fox, Paul Gilbert, Yu Gong, Gabor Grothendieck, Frank E
##   Harrell Jr, Torsten Hothorn, Robert King, Kjetil Kjernsmo,
##   Roger Koenker, Philippe Lambert, Jan de Leeuw, Jim Lindsey,
##   Patrick Lindsey, Catherine Loader, Gordon Maclean, John
##   Maindonald, David Meyer, Ei-ji Nakama, Jens Oehlschaegel,
##   Steve Oncley, Richard O'Keefe, Hubert Palme, Roger D. Peng,
##   Jose' C. Pinheiro, Tony Plate, Anthony Rossini, Jonathan
##   Rougier, Petr Savicky, Guenther Sawitzki, Marc Schwartz,
##   Arun Srinivasan, Detlef Steuer, Bill Simpson, Gordon
##   Smyth, Adrian Trapletti, Terry Therneau, Rolf Turner, Bill
##   Venables, Gregory R. Warnes, Andreas Weingessel, Morten
##   Welinder, James Wettenhall, Simon Wood, and Achim Zeileis.
##   Others have written code that has been adopted by R and is
##   acknowledged in the code files, including

sentences <- c("Jane saw a cat", "Jane sat down")
word(sentences, 1)               # Extract words from a sentence.

## [1] "Jane" "Jane"

word(sentences, 2)

## [1] "saw" "sat"

word(sentences, -1)

## [1] "cat"  "down"

word(sentences, 2, -1)

## [1] "saw a cat" "sat down"

word(sentences[1], 1:3, -1)      # Also vectorised over start and end

## [1] "Jane saw a cat" "saw a cat"      "a cat"

word(sentences[1], 1, 1:4)

## [1] "Jane"           "Jane saw"       "Jane saw a"     "Jane saw a cat"

str <- 'abc.def..123.4568.999'
word(str, 1, sep = fixed('..'))  # Can define words by other separators

## [1] "abc.def"

word(str, 2, sep = fixed('..'))

## [1] "123.4568.999"

Package `rex`

Package rex helps constructing complex regex’s in an higher level of abstraction.

library(rex)

## 
## Attaching package: 'rex'

## The following object is masked from 'package:stringr':
## 
##     regex

strings <- c("test", "a test", "abc")

strings %>% re_matches( rex("t", zero_or_more(".")) )

## [1]  TRUE  TRUE FALSE

# If there are captures in the regular expression, returns a data.frame with 
# a column for each capture group.
strings %>% re_matches( rex(capture("t"), zero_or_more(".")) )

##      1
## 1    t
## 2    t
## 3 <NA>

reg() returns regex objects that can be used afterwards:

re <- rex(
  capture(
    group("a" %or% "b"),
    one_or_more(non_spaces)
  )
)

strings %>% re_matches(re)

##      1
## 1 <NA>
## 2 <NA>
## 3  abc

Here’s a regex for URL matching from the package’s vignette:

valid_chars <- rex(except_some_of(".", "/", " ", "-"))

re <- rex(
  start,

  # protocol identifier (optional) + //
  group(list('http', maybe('s')) %or% 'ftp', '://'),

  # user:pass authentication (optional)
  maybe(non_spaces,
    maybe(':', zero_or_more(non_space)),
    '@'),

  #host name
  group(zero_or_more(valid_chars, zero_or_more('-')), one_or_more(valid_chars)),

  #domain name
  zero_or_more('.', zero_or_more(valid_chars, zero_or_more('-')), one_or_more(valid_chars)),

  #TLD identifier
  group('.', valid_chars %>% at_least(2)),

  # server port number (optional)
  maybe(':', digit %>% between(2, 5)),

  # resource path (optional)
  maybe('/', non_space %>% zero_or_more()),

  end
)

good <- c(
  "http://foo.com/blah_blah/",
  "http://foo.com/blah_blah_(wikipedia)",
  "http://www.example.com/wpstyle/?p=364",
  "http://1337.net",
  "http://a.b-c.de",
  "http://223.255.255.254")

bad <- c(
  "http://",
  "http://.",
  "http://..",
  "http://../",
  "http://?")

all(grepl(re, good) == TRUE)

## [1] TRUE

all(grepl(re, bad) == FALSE)

## [1] TRUE

And this server log parsing from another vignette:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:rex':
## 
##     escape

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

logs <- c(
  '199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245', 
  'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
  '199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085',
  'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0',
  '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179'
)

logs %>%
  re_matches(
    rex(

      # Get the time of the request
      "[",
        capture(name = "time",
          except_any_of("]")
        ),
      "]",

      space, double_quote, "GET", space,

      # Get the filetype of the request if requesting a file
      maybe(
        non_spaces, ".",
        capture(name = 'filetype',
          except_some_of(space, ".", "?", double_quote)
        )
      )
    )
  ) %>%
  mutate(filetype = tolower(filetype),
         time = as.POSIXct(time, format="%d/%b/%Y:%H:%M:%S %z"))

##                  time filetype
## 1 1995-07-01 05:00:01         
## 2 1995-07-01 05:00:06         
## 3 1995-07-01 05:00:09     html
## 4 1995-07-01 05:00:11     html
## 5 1995-07-01 05:00:11      gif

Strings

João Neto

May 2013

Standard functions

Regular Expressions

Package `stringr`

Package `rex`

Strings

João Neto

May 2013

Standard functions

Regular Expressions

Package stringr

Package rex

Package `stringr`

Package `rex`