Regular Expression Cheat Sheet

Kan_Nishida · April 14, 2017, 4:01pm

Here is a great cheat sheet of Regular Expression in R, from the folks at RStudio.

If you have any questions of using RegEx, feel free to ask!

pgensler · April 15, 2017, 10:19pm

This book is a great reference for regular expressions, as well as the stringr site(which I wish I had know about earlier) :

http://stringr.tidyverse.org/articles/regular-expressions.html

Alan_Ponce · April 15, 2017, 11:31pm

Hi

I am stuck with some pdfs that I had to convert to text and they lost its tabular format.

The data (wich is .txt file) have this structure

Question A
This is the text that I want to extract.

Question B
This is the text in answer B.

Question N
Answer N

I am wondering how to extract the text among questions, e.g. A and B and then reshape it as a tabular format. For instance

Question Answer
Question A This is the text that I want to extract
Question B This is the answer B
Question N Answer N

Any clues?

Many thanks in advance.

Alan

Kan_Nishida · April 17, 2017, 2:54am

HI Alan, Can you send me the text file so that I can look into it?

pgensler · April 22, 2017, 9:10pm

@Alan_Ponce I think something like this might be what you are after. I had a similar situation trying to analyze a large txt file, and I think this approach works pretty well.
See this:

Assuming your data has two lines for each chunk, I think something like this might work for you if you had a delimiter between Question A and your text:

read_lines from readr would read your text into a large character vector, which would allow you to parse it easily to work with.
(\s) captures the space in between your questions as a delimiter, and you can spread your data based on that delimiter into columns:

>  ReadChunkFile <- function(x) {
>   data_frame(text = read_lines(x)) %>%
>     filter(text != "") %>% 
>     separate(text, c("var", "value"), "(\s)", extra = "merge") %>% 
>     mutate(
>       chunk_id = rep(1:(nrow(.) / 13), each = 13),
>       value    = trimws(value)
>     ) %>% 
>     spread(var, value)
> }

dataset would be a filepath to where your dataset is
df <- ReadChunkedFile(dataset)
Now you can call your function to read in the data.
@Kan_Nishida It might be worthwhile to have a tutorial on how to do this using exploratory (with functions from tidyr in exploratory)?