GRPH: Overlap Graphs | Ben Cunningham

GRPH: Overlap Graphs

Problem by Rosalind · on July 1, 2012

A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.

A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail and head is represented by (but not by ). A directed loop is a directed edge of the form .

For a collection of strings and a positive integer , the overlap graph for the strings is a directed graph in which each string is represented by a node, and string is connected to string with a directed edge when there is a length suffix of that matches a length prefix of , as long as ; we demand to prevent directed loops in the overlap graph (although directed cycles may be present).

Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.

Return: The adjacency list corresponding to . You may return edges in any order.

Sample Dataset

>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG

Sample Output

Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323

R

library(dplyr)
library(seqinr)
library(stringr)
library(tidyr)

f <- "grph.txt"
raw <- unlist(read.fasta(f, as.string = TRUE))

dna <- data_frame(
  id = names(raw),
  x = toupper(raw),
  beg = str_extract(x, "^.{3}"),
  end = str_extract(x, ".{3}$")
)

adj <-
  dna %>%
  inner_join(dna, by = c("end" = "beg")) %>%
  filter(id.x != id.y) %>%
  mutate(out = paste(id.x, id.y))
  
cat(adj$out, sep = "\n")
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323