[back to intro] - [to part 2] - [to part 3] - [practical example] - [Capabilities]

StegosauR conceals a message in a relatively straightforward way. Let’s take a sample text:

txt <- "This is StegosauR"

Since messages can be stored only in numeric format, the first step consists in converting the given combination of letters and symbols into a sequence of numbers. The first portion of this first task is achieved by converting any element of the message into Unicode control codes. Then, our sample text becomes:

library(Unicode)
txt.u <- as.u_char(utf8ToInt(txt))

txt.u
##  [1] U+0054 U+0068 U+0069 U+0073 U+0020 U+0069 U+0073 U+0020 U+0053 U+0074 U+0065 U+0067 U+006F
## [14] U+0073 U+0061 U+0075 U+0052

The advantage of control codes, or code points, resides in the fact that they use a hexadecimal format. This means, that, while they still contain letters, there is now a very limited amount of them to deal with. Now we can remove the “U+” portion of each sequence, and then replace every component with a unique two-digit number.

#remove "U+"
txt.u <- sub("U\\+", "", txt.u)

#group everything under a single vector, then split into individual elements
txt.u <- paste(txt.u,collapse="")

txt.u <- strsplit(txt.u, "")[[1L]]

#define what will replace what and proceed with the substitution.
#code for unlist substitution taken from: https://stackoverflow.com/questions/7547597/dictionary-style-replace-multiple-items

map <- setNames(c("22", "23","24","25","32","33","34","35","42","43","44","45","52","53","54","55"),
                c("0", "1", "2","3","4","5","6","7","8","9","A","B","C","D","E","F"))

#create empty container
v <- numeric()

#fill it with numbers
for (i in c(1:length(txt.u))) {
  x <- as.numeric(map[unlist(txt.u[i])])
  v <- cbind(v, x)
}

as.vector(v)
##  [1] 22 22 33 32 22 22 34 42 22 22 34 43 22 22 35 25 22 22 24 22 22 22 34 43 22 22 35 25 22 22
## [31] 24 22 22 22 33 25 22 22 35 32 22 22 34 33 22 22 34 35 22 22 34 55 22 22 35 25 22 22 34 23
## [61] 22 22 35 33 22 22 33 24

Consistency is extremely important here: all elements must always have the same number of digits. All elements can be grouped into a single number or split into individual digits. Yet, knowing that all these elements have the same length allows us to rebuilt the original message structure. For example

v.collapsed <- paste(as.vector(v), collapse="")

v.collapsed
## [1] "2222333222223442222234432222352522222422222234432222352522222422222233252222353222223433222234352222345522223525222234232222353322223324"

This long integer can be converted back to its original message because we know that every hexadecimal element is composed by two digits, and every control code has four elements.

#split in groups of two digits
v.rebuilt <- numeric()
for (w in seq(2,nchar(v.collapsed),2)) {
  
  y <- substr(v.collapsed,w-1,w)
  v.rebuilt <- cbind(v.rebuilt, y)
}

as.vector(v.rebuilt)
##  [1] "22" "22" "33" "32" "22" "22" "34" "42" "22" "22" "34" "43" "22" "22" "35" "25" "22" "22"
## [19] "24" "22" "22" "22" "34" "43" "22" "22" "35" "25" "22" "22" "24" "22" "22" "22" "33" "25"
## [37] "22" "22" "35" "32" "22" "22" "34" "33" "22" "22" "34" "35" "22" "22" "34" "55" "22" "22"
## [55] "35" "25" "22" "22" "34" "23" "22" "22" "35" "33" "22" "22" "33" "24"
#produce a data frame with four columns
df.v <- as.data.frame(matrix(v.rebuilt, nrow = length(v.rebuilt)/4, ncol = 4, byrow=TRUE))

df.v
##    V1 V2 V3 V4
## 1  22 22 33 32
## 2  22 22 34 42
## 3  22 22 34 43
## 4  22 22 35 25
## 5  22 22 24 22
## 6  22 22 34 43
## 7  22 22 35 25
## 8  22 22 24 22
## 9  22 22 33 25
## 10 22 22 35 32
## 11 22 22 34 33
## 12 22 22 34 35
## 13 22 22 34 55
## 14 22 22 35 25
## 15 22 22 34 23
## 16 22 22 35 33
## 17 22 22 33 24
#substitute the two-digit elements with the original hexadecimal values

map <- setNames(c("0", "1", "2","3","4","5","6","7","8","9","A","B","C","D","E","F"),
                c("22", "23","24","25","32","33","34","35","42","43","44","45","52","53","54","55"))

df.v[] <- map[as.vector(unlist(df.v))]

df.v
##    V1 V2 V3 V4
## 1   0  0  5  4
## 2   0  0  6  8
## 3   0  0  6  9
## 4   0  0  7  3
## 5   0  0  2  0
## 6   0  0  6  9
## 7   0  0  7  3
## 8   0  0  2  0
## 9   0  0  5  3
## 10  0  0  7  4
## 11  0  0  6  5
## 12  0  0  6  7
## 13  0  0  6  F
## 14  0  0  7  3
## 15  0  0  6  1
## 16  0  0  7  5
## 17  0  0  5  2
#convert the hexadecimal values to Unicode control codes

codes <- as.u_char(paste(df.v$V1, df.v$V2, df.v$V3, df.v$V4, sep=""))

codes
##  [1] U+0054 U+0068 U+0069 U+0073 U+0020 U+0069 U+0073 U+0020 U+0053 U+0074 U+0065 U+0067 U+006F
## [14] U+0073 U+0061 U+0075 U+0052
#restore the original message

paste(sapply(codes, intToUtf8), collapse="")
## [1] "This is StegosauR"

This is just a proof of concept. StegosauR works a bit differently. For instance, Unicode includes 1,114,112 potential code points, ranging from U+0000 to U+10FFFF. This means that converting a text to control points could result in various combinations of four-, five- and six-digit control codes. Since consistency is key, StegosauR converts all code points into a six-digit format. Then every element of the six-digit code is converted into two-digit values as shown above. This means that eventually each character in our message is represented by a twelve-digit number. Just as an example:

letter <- "R"

letter.u <- as.u_char(utf8ToInt(letter))

letter.u <- sub("U\\+", "", letter.u)

letter.u
## [1] "0052"
map <- setNames(c("22", "23","24","25","32","33","34","35","42","43","44","45","52","53","54","55"),
                c("0", "1", "2","3","4","5","6","7","8","9","A","B","C","D","E","F"))

letter.u <- strsplit(letter.u, "")[[1L]]

letter.u <- map[letter.u]

letter.u <- paste(letter.u,collapse="")

letter.u <- as.numeric(letter.u) + 999900000000

letter.u
## [1] 999922223324
nchar(letter.u)
## [1] 12
[back to intro] - [to part 2] - [to part 3] - [practical example] - [Capabilities]