Working With Strings

Common formulas

FILTERXML, TEXTJOIN, CONCAT, SUBSTITUTE, LEFT, MID, RIGHT, LEN, SEARCH, TRIM, PROPER, UPPER, LOWER, ISNUMBER

Scrubbing and Standardizing Data

Two of the biggest issues when compiling data from multiple sources are a lack of uniformity and the cramming together of what should be multiple different fields. In general, it is best to break up data into their smallest pieces of information (e.g. {“123”}, {“Main St”}, {“Springfield”}, {“MA”}, {“01001”} instead of {“123 Main St, Springfield MA 01001”}).

The below image gives a simple example of poor data and basic techniques to improve it.

For thoroughness, we expand on the formula for the “Last” column. We have:

=SUBSTITUTE(
TRIM(PROPER(
RIGHT(B3, LEN(B3)-SEARCH(" ", B3)))),
", Phd.", "")

Working from the inside out, LEN and SEARCH work together to get the number of characters after the first space in the string.

B3="Jenny     TUTONE"      #example#
=LEN(B3)                            #returns 16#
=SEARCH(" ", B3)                #returns 6#

=LEN(B3)-SEARCH(" ", B3)) #returns 10#

Next, RIGHT is used to grab specific characters in a string

B3="Jenny     TUTONE"        #example#
=LEFT(B3, 5)                         #returns "Jenny"#
=RIGHT(B3, 6)                      #returns "TUTONE"#
=MID(B3, 2, 3)                     #returns "enn"#

=RIGHT(B3, LEN(B3)-SEARCH(" ", B3))  #returns "   TUTONE"#

After that, one can use TRIM to eliminate preceding spaces in a string, and use PROPER to start a string with an uppercase letter and end it with lowercase ones.

B3="     tUtONe"   #example#
=TRIM(B3)             #returns "tUtONe"#
=PROPER(B3)        #"     Tutone"#
#based on the above image, the below return is "Tutone"#

=TRIM(PROPER(RIGHT(B3, LEN(B3)-SEARCH(" ", B3))))

Finally, one can use SUBSTITUTE to substitute parts of string out for different characters, and add on characters with the & sign as needed (always wrapping additions in quotes).

B4="Euler, Phd."                            #example#
=SUBSTITUTE(B4, ", Phd.", "")       #returns "Euler"#
=SUBSTITUTE(B4, ", Msc", "")        #returns "Euler, Phd."#
=SUBSTITUTE(B4, ", Phd.", ", Msc")  #returns "Euler, Msc"#
="Mr."&B4                           #returns "Mr. Euler, Phd."#

Reorder Strings

One of the most useful functions for working with strings is FILTERXML. By putting strings in valid XML format, one can extract parts of a larger string. The below image gives an example of how FILTERXML can be used to reorder substrings.

Working from the inside out, we first use the SUBSTITUTE function described above.

V2=100-123-ABC
=SUBSTITUTE(V2, "-", "</s><s>")    #returns 100</s><s>123</s><s>ABC#

Next we add xpath tags with the “&” operator to finish putting the string in XML format for use in FILTERXML. Note that each parent/child tag does not need to be denoted “t” and “s”; other characters or strings are also valid.

A1="<t>                        #example#
<s>100</s>
<s>123</s>
<s>ABC</s>
</t>"

=FILTERXML(A1, "//s")    #returns {100}, {123}, {"ABC"}#

After breaking the string into parts, we can INDEX the resultant dynamic array to reorder the substrings.

A1=100                           #example#
A2=123                    
A3=ABC                    

=INDEX(A1:A3, {1,3,2})    #returns {100}, {ABC}, {123}#

After INDEXing, we still have a dynamic array. We can join each cell together, separating the cells with a given delimiter (in this case “-“, but most anything is valid), using TEXTJOIN. Note that the second argument in the function (the 1) signifies that the function ignores any empty cells in the given range.

A1=100                                #example#
A2=ABC
A3=123

=TEXTJOIN("-", 1, A1:A3)    #returns {"100-ABC-123"}#

Break Strings Into Parts

Combining FILTERXML with LET can be especially useful. The below image gives an example of how the two can be used to break a long string into more digestible parts. The example is dialogue between a user and agent.

The essence of the LET function is that it enables users to name variables for later use. This improves readability and allows users to perform multiple actions in one go rather than use helper columns. Note that variable names can be most anything besides valid cell references.

Starting with FILTERXML as shown in the above section, we have:

V1="I'm looking to get some help. Agent: Sure thing, what can I do for you?"
V2="Well, I ordered a package."     #trimmed example#



=LET(x, V1:V2)                      #stores the range as a variable named "x"#

Next, we would like to remove any parts of the string of the agent’s reply. To do so, we first find the character number that the agent’s reply starts on, accounting for cases where there is no agent reply by using error handling functions.

V1="I'm looking to get some help. Agent: Sure thing, what can I do for you?"
V2="Well, I ordered a package."                  #trimmed example#

=SEARCH("Agent", V1)-2                             #returns 29#
=SEARCH("Agent", V2)-2                             #returns NA#
=IFERROR(SEARCH("Agent", V1:V2)-2, LEN(V1:V2))   #returns {29}, {26}#

We now have two dynamic ranges “x” and “y”, the first of which was used in the creation of the second. We proceed with our short-term goal of removing the agent’s reply by using the familiar LEFT function.

V1="I'm looking to get some help. Agent: Sure thing, what can I do for you?"
V2="Well, I ordered a package."  #trimmed example#

="User: "&LEFT(V1, 29)           #returns "User: I'm looking to get some help."#
="User: "&LEFT(V2,26)            #returns "User: Well, I ordered a package."#

Notice how the use of LET greatly simplifies the formula:

=LET(x, FILTERXML("<t><s>"&SUBSTITUTE(V1, "User:", "</s><s>")&"</s></t>", "//s"),
y, IFERROR(SEARCH("Agent", x)-2, LEN(x)),
"User: "&LEFT(x,y))

vs. 

="User: "&
LEFT(FILTERXML("<t><s>"&SUBSTITUTE(V1, "User:", "</s><s>")&"</s></t>", "//s"),
IFERROR(SEARCH("Agent", 
FILTERXML("<t><s>"&SUBSTITUTE(V1, "User:", "</s><s>")&"</s></t>", "//s"))-2, 
LEN(FILTERXML("<t><s>"&SUBSTITUTE(V1, "User:", "</s><s>")&"</s></t>", "//s")))

Extract Parts Of Strings

When a column/row of strings follow similar formats, it can be useful to extract smaller pieces of information to maintain best practice. In the example below, we see a column of strings that give trade confirmations. We would like to extract pieces of information from each string (Buy/Sell, Quantity, Ticker, Instrument , Expiration Date, Strike Price, Premium, etc.), so for illustration will extract the premium and organize it into columns representing different instrument and Buy/Sell combinations.

Our strategy is to first see if the instrument for the trade was a put or a call option, and then categorize the premium based on if the trade bought or sold the instrument. To prevent having to enter different formulas in different columns, we aim to just drag the formula in one column, and have the results dynamically fill into the others.

To do so, we first test if the string contains the word “Call” using SEARCH.

A1="Buy 1 HD 9/3 325 Call @2.75"                  #example#

=IF(ISNUMBER(SEARCH("Call", A1)), "A", "B")   #returns "A"#
=IF(ISNUMBER(SEARCH("Put", A1)), "A", "B")    #returns "B"#

If the string does contain “Call”, we then test if it contains “Bought” or “Sold”. To avoid repeating arguments, we can test both as an array by using brackets ({}).

A1="Buy 1 HD 9/3 325 Call @2.75"                                 #example#

=IF(ISNUMBER(SEARCH({"Buy", "Sell"}, A1)), "A", "B")      #returns {"A", "B"}#

If the condition is met, we continue by extracting the premium using similar methods as described above with RIGHT, LEN, and SEARCH. If the condition isn’t met (the string doesn’t contain “Bought” or “Sold”), the return is blank. Note that in each string, exactly one condition will be met, and that to ensure that the premium is not stored as text, we save it as a number using NUMBERVALUE.

A1="Buy 1 HD 9/3 325 Call @2.75"                                                #example#

=IF(ISNUMBER(SEARCH({"Buy", "Sell"}, A1)), 
NUMBERVALUE(RIGHT($A1, LEN($A1)-SEARCH("@", $A1))), "")    #returns {2.75, ""}#

Resolving the first IF statement, we address the case where the string doesn’t contain “Call”, and use similar methods as described above to extract the premium. The key difference is extending the array that is being tested so that the formula can be dragged in just one column instead of many. To prevent a return in the columns reserved for strings containing “Call”, we search for strings that are unlikely to appear in the string being tested. Something like “!!!” is a good choice.

A2="Sell 1 AI 9/17 50 Put @1.5"                                                  #example#

=IF(ISNUMBER(SEARCH({"!!!", "!!!", "Buy", "Sell"}, A2)), 
NUMBERVALUE(RIGHT($A2, LEN($A2)-SEARCH("@", $A2))), "")  #returns {"","","",1.5}#

Putting it all together we get

=IF(ISNUMBER(SEARCH("Call", A1)),

IF(ISNUMBER(SEARCH({"Buy", "Sell"}, A1)), 
NUMBERVALUE(RIGHT($A1, LEN($A1)-SEARCH("@", $A1))), ""),

IF(ISNUMBER(SEARCH({"!!!", "!!!", "Buy", "Sell"}, A1)), 
NUMBERVALUE(RIGHT($A1, LEN($A1)-SEARCH("@", $A1))), ""))  

Another example of extracting information is shown below. In this case, we use the FILTERXML described earlier, and only return strings that are an integer greater than 36.

Frequently Used Words

While Excel is not a normal candidate to deal with word frequency, it is possible to do some rudimentary analysis using some of the formulas which we’ve already seen. In the example below, we are given a column of phrases and would like to return the five most used words in the column.

String Matching