Further Evasion in the Forgotten Corners of MS-XLS

It’s been a few weeks since my last discussion1 of Excel 4.0 macro shenanigans and the space continues to change. LastLine published a great report2 which summarized the progression of weaponized macros from February through May. The good folks at InQuest have continued3 identifying4 malicious5 macro documents6. @DissectMalware‘s excellent XLMMacroDeobfuscator7 has massively expanded its range of macro emulation, and FortyNorth Security released EXCELntDonut8, a tool for converting Donut9 shellcode into multi-architecture Excel 4.0 macros.

Over the past few weeks I’ve also started seeing some of the files generated by my tool Macrome10 begin to trigger detections on VirusTotal11. This is exactly the sort of thing I want to see – besides the fact that it implies that AV is getting better signal on this attack vector, it also provides an opportunity to improve my tool and take better guesses about what direction attackers will pivot in the future. I’m a big believer in a @Mattifestation‘s approach to detection engineering12 and detection from AV helps move the iterative development of tooling further along.

After realizing that some of my samples were being detected, I took several documents that had been generated during testing and submitted each of them to VirusTotal – only the larger documents appeared to be matching virus signatures. I did a quick binary search of the document sizes between what was detected on VirusTotal and what wasn’t and discovered that if a document had greater than 100 CHAR invocations, then it was considered malicious.

A “safe” document with exactly 100 =CHAR() expressions
A document that has one too many =CHAR() expressions

While my generated document had obfuscated the usage of the CHAR function, clearly there was a signature that could detect these alternate CHAR invocations. For reference, here is @DissectMalware’s macro_sheet_obfuscated_char rule13 that the generated document attempted to avoid:

rule macro_sheet_obfuscated_char
{
  meta:
    description = "Finding hidden/very-hidden macros with many CHAR functions"
    Author = "DissectMalware"
    Sample = "0e9ec7a974b87f4c16c842e648dd212f80349eecb4e636087770bc1748206c3b (Zloader)"
  strings:
    $ole_marker = {D0 CF 11 E0 A1 B1 1A E1}              
    $macro_sheet_h1 = {85 00 ?? ?? ?? ?? ?? ?? 01 01}
    $macro_sheet_h2 = {85 00 ?? ?? ?? ?? ?? ?? 02 01}    
    $char_func = {06 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 1E 3D  00 41 6F 00}
  condition:
    $ole_marker at 0 and 1 of ($macro_sheet_h*) and #char_func > 10
}

My previous blog post discussed how to break the longer signature for $char_func, but it didn’t address what to do if the signature for the CHAR function were more reliable. In this case the signature was likely only the the three bytes of a PtgFunc14 invocation with the CHAR Ftab value15 (41 6F 00) but repeatedly occurring enough times to avoid false positives. This is likely the reason for the “high” minimum count requirement of 101+ instances versus the 11+ in the macro_sheet_obfuscated_char rule.

An obfuscated invocation of CHAR(65) that triggered results on VirusTotal after 101+ instances were used

One “quick” hack to bypass this signature is to abuse the fact that PtgFuncVar16 can be used instead of PtgFunc to invoke the CHAR function (42 01 6F 00). PtgFuncVar is largely identical to PtgFunc except for the fact that PtgFuncVar must also be provided with the number of arguments being passed into the called function. While PtgFunc is only used to call functions with a fixed number of arguments, there is nothing that stops us from invoking PtgFuncVar and providing the correct argument count. PtgFunc(CHAR) is identical to PtgFuncVar(1,CHAR).

Hex dump of a FORMULA17 BIFF8 record using the alternate PtgFuncVar(1,CHAR) invocation

This is a nice signature evasion trick, but it ultimately is vulnerable to the same method of detection, just with a slightly different byte signature. Fundamentally, many tricks that macro sheets rely on in order to deobfuscate themselves will rely on invoking a handful of functions repeatedly. Large macro payloads can require invoking some form of CHAR and FORMULA hundreds of times – what will adversaries do once there are better signatures put into place for detecting suspiciously repeated usages of these functions?

Re-Enter the Subroutine

In normal programming, when we constantly call the same code over and over again, we write a function. Even in VBA macros, the idea of subroutines exist to allow for simple code-reuse. While the Excel 4.0 Macro Functions Reference18 mentions the idea of Excel 4.0 macro subroutines several times – it never actually details how these can be created.

In practice, Excel 4.0 macro subroutines are really just a sequence of RUN and RETURN functions. A subroutine is invoked by calling the RUN function with an argument referencing the start cell of the sub-macro. Execution then starts at that cell and continues down the column until a RETURN function is invoked. The argument passed to RETURN is what the return value of the function will be. For example, if we wanted to create a subroutine that would eventually return the string “Hello World”, it would look something like this:

A simple example of an Excel 4.0 macro subroutine – it will eventually pop up an alert saying “Hello World”

Excel actually even aliases the RUN command by letting users specify a cell reference or cell name and invoke it directly by appending () to the invocation as seen below:

This is functionally identical to the previous Macro sheet
This is also the same, except B1 has been named MySub

It’s not a very common way to see macros used right now, but malware authors are clearly already aware of this19 as can be seen from a sample shared by @JohnLaTwc and analyzed by @DissectMalware:

Image
Example behavior from a maldoc submitted to VirusTotal in March 201920 (Image from @DissectMalware)

While using subroutines in this way might be slightly helpful for slowing analysis of a document, it’s really only dipping its toes into the potential of “proper” subroutine usage in a maldoc. For example, what if instead of having the byte sequence 41 6F 00 every time we invoked CHAR, we moved the CHAR expression into a subroutine and just invoked the subroutine repeatedly? The predictable function invocation would only appear once, and it would be much harder to claim that EVERY usage of CHAR is malicious. Even Windows Defender’s aggressive blocking of =CHAR(#) invocations requires other conditions beyond matching three bytes. Here’s an example of what replacing the CHAR expression with a subroutine looks like:

We can actually “create” our subroutine at runtime using SET.NAME to specify the subroutine cell and its argument

So this is slightly different from our previous examples, but the main difference is that we are invoking SET.NAME in order to specify two values:

  1. We are defining the value of InvokeCharSub to be equivalent to a reference to cell B1. Later we invoke it using InvokeCharSub(), though we could also use RUN(InvokeCharSub).
  2. We are setting the value of the name “arg” to 65. This is essentially how we pass arguments to our subroutine. While there does appear to be an ARGUMENT function that allows explicitly defining names to store arguments, I haven’t been able to make this work any differently than just manually setting names or cell values. While porting EXCELntDonut macros into Macrome21 I also realized that you can simply write arg=65 in an Excel cell, and it will automatically be interpreted as SET.NAME(“arg”,65)
What a User Defined Function invocation looks like in byte form

Under the covers when we call InvokeCharSub(), we are having Excel call a user defined function through the PtgFuncVar Parse Thing object. User defined functions are a PtgFuncVar edge case – one of the arguments provided to the PtgFuncVar must be a PtgName22. PtgName objects reference a Lbl23 entry stored within the Excel Workbook’s Globals Substream24. In this case, we are looking for the 3rd Lbl entry in the substream – it’s also worth noting that the index here starts at 1, rather than 0. We’ll come back to some “fun” that malware authors can have with these labels later.

The Lbl list from our test document’s Globals Substream – the 3rd item is InvokeCharSub, our subroutine name

So we have a mechanism to replace our CHAR function invocations with SET.NAME invocation followed by a call to a user defined function. This turns one very simple cell into two cells, but there’s a workaround for that as well. A final possible optimization to reduce the size of our document is to combine our variable assignment with the invocation of our subroutine by abusing the IF function to execute two expressions in a single cell – for example:

=IF(SET.NAME("var",65),invokeChar(),)

The invocation of SET.NAME here saves us from having to use two cells to invoke our subroutine and lets us use a single cell which cuts down on our FORMULA record count by about half. This is the approach used by the CharSubroutine method in Macrome10.

Going back to @Mattifestation‘s detection engineering approach – let’s think about how we could detect this sort of approach and then analyze it. From a detection standpoint, a massive number of invocations of SET.NAME and PtgFuncVar objects with a user defined function would likely stand out. For example, if we look at the above IF statement at the byte level we get something like:

A single FORMULA record containing the SET.NAME and user defined function invocation

We can create a signature for this by keying on the presence of a PtgFuncVar invocation of SET.NAME (42 02 58 00) with some arbitrary locality to a PtgFuncVar invoking a user defined function (42 ?? FF 00 – the Ftab value is FF 00, but we need a wildcard since we can’t necessarily guess the argument count). Our signature doesn’t need to care if SET.NAME comes before or after the user defined function, we just want to check for a large number of these instances. A Yara25 signature for this could look like:

rule msxls_set_name_and_invoke_udf
{
  meta:
    description = "Finding XLS2003 documents with a suspicious number of SET.NAME and User Defined Function invocations"
    Author = "Michael Weber (@BouncyHat)"
  strings:
    $ole_marker = {D0 CF 11 E0 A1 B1 1A E1}
    $setname_invokeudf = {42 02 58 00 [0-100] 42 ?? FF 00}
    $invokeudf_setname = {42 ?? FF 00 [0-100] 42 02 58 00}
  condition:
    $ole_marker at 0 and (#setname_invokeudf > 100 or #invokeudf_setname > 100)
}

Note that the wildcard range [0-100] probably makes this computationally expensive to run on a large dataset, but the upper bound of 100 wildcard bytes could be lowered as needed.

This signature could still be avoided (as is true for most signatures) with a little additional effort on the part of the attacker. As demoed in Outflank’s research26, we can use Excel’s WHILE functionality to iterate over a column of seemingly harmless numbers and use them to build strings of binary data or additional macro statements to populate with the FORMULA function.

Here we have a Macro, starting at B1, that replaces our numerous CHAR() invocations with a subroutine at A1

But let’s assume that there is a foolproof signature to identify our document and that our document has made its way into the hands of an analyst armed with a tool like XLMMacroDeobfuscator6 or olevba27. Are there any weird behaviors that can be abused to trick analysts attempting to examine our document? Thanks to Excel’s “flexibility” with Lbl records, the answer is yes.

(Ab)Using Names in Excel 4.0 Macros

The usage of Lbl record lookups when resolving names is another opportunity for malware authors to frustrate analysis. In my previous blog post1 I discussed how Excel’s flexible handling of the Auto_Open Lbl record made signature creation extremely challenging. It seems like similar issues would apply to “variable” and subroutine name invocation as well. For example – what would you expect the output of the following macro sheet to be?

Assuming case sensitivity were used, the string “arg” should be displayed
But Excel Lbl records are much more flexible than that

This looks like a nice trick, but it doesn’t appear to do much to frustrate analysis – at a glance. Just HOW flexible is Excel’s interchangeability with upper case and lower case letters?

What happens if we go into the Unicode character sets?
Obviously the lower case Zeta symbol (ζ) was going to overwrite that capital Zeta (Ζ)

It’s pretty flexible. There are a surprising number of multi-case characters to confuse Excel, just take a glance at the library of valid lower case Unicode characters28. Unfortunately, for defenders, the PtgStr record29 used by Excel to invoke SET.NAME will happily allow attackers to set arbitrary Unicode content for arguments, so this is a challenging situation to avoid. The issues don’t stop at casing confusion either – Excel also respects Unicode Equivalence30. This behavior, which is part of the Unicode specification31, is a consistent32 source of pain33 in the security world34.

One example of how Unicode Equivalence can frustrate analysis is Decomposed Unicode. Decomposed Unicode values are alternate representations of Unicode characters that use a series of characters instead of a single Unicode character. For example – consider the Unicode character 35. This can be represented as 2 bytes in UTF-16 (Excel’s Unicode interpretation) as 1E 01. Alternatively, we can represent it as the letter a and the ◌̥ combining diacritical mark36 – or 00 61 03 25. (Note: These diacritical marks are the same bit of fun that can be used to create Zalgo monstrosities37)

There also exist Unicode characters, like the Combining Graphene Joiner38 (03 4F) which are essentially no-op characters for most Unicode strings. The Wikipedia article for the character explicitly describes it as “default ignorable” in the first sentence:

“The combining grapheme joiner (CGJ), U+034F ͏ COMBINING GRAPHEME JOINER (HTML ͏) is a Unicode character that has no visible glyph and is “default ignorable” by applications.”

https://en.wikipedia.org/wiki/Combining_Grapheme_Joiner

Finally, there are a sizable number of Unicode whitespace characters39 which can change the byte contents of a string without changing its appearance. The “most interesting” of these whitespace characters are the zero-width Unicode characters. A zero-width character makes no visible change to the label. Some of these characters are ignored by Excel when comparing strings (U+200C, U+200D, U+2060, and U+FFEF), but others (U+180E and U+200B) are not. These characters can be used to pad variable names, or create decoy names that look the same but are not actually assigned when invoking SET.NAME.

There’s nothing fundamentally bad about following the Unicode specification, but combining support for Unicode equivalence with some of Excel’s other flexibility can lead to very counter-intuitive equivalencies. For example, 1E 01 () is considered the same as 20 60 00 41 03 25 03 4F 00 (a decomposed with some ignored Unicode characters added to the string). Replacing some of those bytes with a 18 0E or 20 0B would break the equivalency as well, which allows us to create strings that look identical, but are not treated as such by Excel. In practice this lets us create, using Macrome’s10 AntiAnalysisCharSubroutine method, the following content :

It is random whether the first SET.NAME or second SET.NAME in each cell set the value passed to the subroutine

Although the vḁr strings appear to be identical, they are in fact quite different on disk. This means that any analysis of the cell to figure out what will actually happen will require running Excel or manually reproducing Excel’s EXACT handling of Unicode characters. Reproducing the behavior is going to require handling a lot of edge cases. If you want a sense of what analysts could be up against, here’s what the above example looks like in binary:

Note that both SET.NAME arguments are very different from the Lbl name used in =RETURN(CHAR(‘vḀr’))

In the above example the “Real” argument bytes are considered a match for the Lbl name bytes, but the “Decoy” argument bytes are not. The fact that Lbl record strings can be so wildly different from the PtgStr arguments passed to SET.NAME makes it challenging to follow Excel’s data flow without actually running Excel. Even then, Excel isn’t consistent with handling Unicode values – see what happens when null bytes are injected into the Auto_Open label after the u character:

The Name Manager sees Au, but the cell label is AuTo_OpEn

Given the already low detection rate for Excel 4.0 macros in the wild, we may never see attackers need to rely on this level of trickery. If AV does start getting better signal with their signatures though, I will not be surprised to see various forms of Unicode abuse begin to crop up.

Updates to Macrome

In the process of digging deeper into Excel documents, I’ve often come across a need to examine the byte content of specific records as a hex dump. While I don’t mind crawling through a wall of hex text, I’ve managed to save some time by modifying my tool Macrome to dump the hex content of Lbl and Formula records. All of the hex examples from this post were generated using this dump functionality. I’ve also implemented code for generating proof-of-concept documents using some of the subroutine and Unicode shenanigans that I discussed in this post. If you want to try generating some malicious documents to see how your tooling will handle these kinds of documents I’d suggest heading over to https://github.com/michaelweber/Macrome and grabbing the latest release.

As always, if folks have any suggestions for features or improvements, please let me know here in the comments or open an issue on the Github project page.

References

  1. https://malware.pizza/2020/05/12/evading-av-with-excel-macros-and-biff8-xls/
  2. https://www.lastline.com/labsblog/evolution-of-excel-4-0-macro-weaponization/
  3. https://inquest.net/flash-alerts/IQ-FA004%3AMultiple_Actors_Abusing_New_Macro_Methods
  4. https://twitter.com/InQuest/status/1268568312499376130
  5. https://twitter.com/DissectMalware/status/1268491222299086854
  6. https://github.com/DissectMalware/XLMMacroDeobfuscator
  7. https://twitter.com/Anti_Expl0it/status/1269895583633829888
  8. https://github.com/FortyNorthSecurity/EXCELntDonut/
  9. https://github.com/TheWover/donut
  10. https://github.com/michaelweber/Macrome
  11. https://www.virustotal.com/gui/file/b159b25b80b1830acf40813c06a48f3e72666720b7efcd406ea5031c7f214c31/detection
  12. https://twitter.com/mattifestation/status/1263416936517468167
  13. https://pastebin.com/V8SGgdZL
  14. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/87ce512d-273a-4da0-a9f8-26cf1d93508d
  15. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/00b5dd7d-51ca-4938-b7b7-483fe0e5933b
  16. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/5d105171-6b73-4f40-a7cd-6bf2aae15e83
  17. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/8e3c6978-6c9f-4915-a826-07613204b244
  18. https://exceloffthegrid.com/using-excel-4-macro-functions/
  19. https://twitter.com/DissectMalware/status/1269535826813366273
  20. https://www.virustotal.com/gui/file/a53be0bd2a838ffe172181f3953a2bc8a1b7c447fb56d885391921a7c3eac1f9/details
  21. https://github.com/michaelweber/Macrome/releases/tag/0.2.0
  22. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/5f05c166-dfe3-4bbf-85aa-31c09c0258c0
  23. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/d148e898-4504-4841-a793-ee85f3ea9eef
  24. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/ca4c1748-8729-4a93-abb9-4602b3a01fb1
  25. https://virustotal.github.io/yara/
  26. https://outflank.nl/blog/2018/10/06/old-school-evil-excel-4-0-macros-xlm/
  27. https://github.com/decalage2/oletools/wiki/olevba
  28. https://www.compart.com/en/unicode/category/Ll
  29. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/87c2a057-705c-4473-a168-6d5fac4a9eba
  30. https://en.wikipedia.org/wiki/Unicode_equivalence
  31. https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf
  32. https://www.dionach.com/en-us/blog/fun-with-sql-injection-using-unicode-smuggling/
  33. https://hackernoon.com/%CA%BC-%C5%9B%E2%84%87%E2%84%92%E2%84%87%E2%84%82%CA%88-how-unicode-homoglyphs-will-break-your-custom-sql-injection-sanitizing-functions-1224377f7b51
  34. https://book.hacktricks.xyz/pentesting-web/unicode-normalization-vulnerability
  35. https://www.compart.com/en/unicode/U+1E01
  36. https://www.compart.com/en/unicode/U+0325
  37. https://zalgo.it/en/
  38. https://en.wikipedia.org/wiki/Combining_Grapheme_Joiner
  39. https://en.wikipedia.org/wiki/Whitespace_character

Evading Detection with Excel 4.0 Macros and the BIFF8 XLS Format

Abusing legacy functionality built into the Microsoft Office suite is a tale as old as time. One functionality that is popular with red teamers and maldoc authors is using Excel 4.0 Macros to embed standard malicious behavior in Excel files and then execute phishing campaigns with these documents. These macros, which are fully documented online, can make web requests, execute shell commands, access win32 APIs, and have many other capabilities which are desirable to malware authors. As an added bonus, the Excel format embeds macros within Macro sheets which can be more challenging to examine statically than VBA macros which are easier to extract. As a result, many malicious macro documents have a much lower than expected rate of detection in the AV world.

Malware campaigns, such as the ZLoader campaign (described in great detail by InQuest Labs here, here, and here) are actively abusing this functionality to perform mass phishing attacks. The campaign is so prolific that I’ve actually received one of these maldocs in one of my personal email accounts. Because of its effectiveness and low detection rate, this technique is also popular in the penetration testing community. Outflank described how to embed shellcode in Excel 4.0 Macros in 2018, and tooling has been published to abuse this functionality via Excel’s ExecuteExcel4Macro VBA API.

While there is clearly already a spotlight on the subject of Excel 4.0 Macros, I believe that only the surface of this attack vector has been scratched. There’s no doubt that defenders are building better signal on malicious macros (one of the tools which originally had 0 detections on VirusTotal is now up to 15 at the time of writing this post), but there is also evidence that some of this signal can be brittle and unreliable.

For example, the ZLoader campaign obfuscates its macros using a series of cells that build each command from CHAR expressions. Ex: =CHAR(61) evaluates to the = character.

A ZLoader Campaign’s Macro Sheet (image from @DissectMalware)

There’s plenty to build a signature on in this sheet:

  • The repeated usage of the =CHAR(#) cells to define formula content one character at a time.
  • The use of the Auto_Open label which triggers automatic execution of the macro sheet once the “Enable Content” button is pressed.
  • ZLoader marks their macro sheets as hidden which has a detectable static signature
  • The use of numerous Formula expressions to dynamically generate additional expressions at runtime.

A lot of this would appear to be good enough signal to just block outright – Windows Defender, for example, considers just about any usage of =CHAR(#) to be malicious. Making an empty macro sheet that contains one cell with =CHAR(42) and another with =HALT() will immediately flag the document as malicious:

If you try to save this document with Windows Defender enabled, it will block the save operation

This is probably a bit overkill, but apparently the number of legitimate users that do this is small enough that Windows can roll out a patch to all machines marking it malicious. A more reasonable signature, which seems resistant to false positives, is @DissectMalware’s macro_sheet_obfuscated_char YARA rule:

rule macro_sheet_obfuscated_char
{
  meta:
    description = "Finding hidden/very-hidden macros with many CHAR functions"
    Author = "DissectMalware"
    Sample = "0e9ec7a974b87f4c16c842e648dd212f80349eecb4e636087770bc1748206c3b (Zloader)"
  strings:
    $ole_marker = {D0 CF 11 E0 A1 B1 1A E1}              
    $macro_sheet_h1 = {85 00 ?? ?? ?? ?? ?? ?? 01 01}
    $macro_sheet_h2 = {85 00 ?? ?? ?? ?? ?? ?? 02 01}    
    $char_func = {06 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 1E 3D  00 41 6F 00}
  condition:
    $ole_marker at 0 and 1 of ($macro_sheet_h*) and #char_func > 10
}

This rule looks for three things:

  1. The standard magic header for Office documents D0CF11E0A1B11AE1 at the start of the file.
  2. A macro sheet (defined in a BoundSheet8 BIFF Record) with a hidden state set to Hidden or VeryHidden.
  3. The presence of at least 10 Formula BIFF Records which have an Rgce field containing two Ptg structures – a PtgInt representing the value 0x3D (which maps to the = character) and a PtgFunc with an Ftab value of 0x6F (the matching tab value for the CHAR function).

Unless you are fairly acquainted with the Excel 2003 Binary format (also known as BIFF8), the third search condition is likely to read as a series of random letters jammed together rather than anything coherent. To better understand what exactly is being discussed, let’s take a quick detour into the BIFF8 file format.

The Excel 97-2003 Binary File Format (BIFF8)

Before office documents were saved in the Open Office XML (OOXML) format, they were saved in a much more succinct binary format focused on describing the maximum amount of information with the minimum number of bytes. Legacy office documents are stored in a Compound Binary File Format (CBF) while their actual application specific data (such as Word document content or Excel workbook information) is stored within binary streams embedded in the CBF header.

Excel’s workbook stream is a direct series of  Binary Interchange File Format (BIFF) records. The records are fairly simple – there are 2 bytes for describing the record type, 2 bytes for describing the remaining length of the record, and then the relevant record bytes. An Excel workbook is just a series of BIFF records beginning with a BOF record and eventually ending with a final EOF record. Microsoft’s Open Specifications project has helpfully documented every one of these records online. For example, if we are parsing a stream and read a record beginning with the byte sequence 85 00 0E 00, we are reading a BoundSheet8 record that is 14 bytes long.

From Microsoft’s documentation we can see that BoundSheet8 records contain a 4 byte offset pointing to the relevant BOF record, 2 bits used for describing the visible state of the sheet, a single byte used for describing the sheet type, and a variable number of bytes used for the name of the sheet.

Hex dump of a VeryHidden Macro sheet’s BoundSheet8 BIFF record

The above hex dump represents a BoundSheet8 record for a Macro sheet that has been “Very Hidden” – essentially made inaccessible from within Excel’s UI. This record would match the YARA sig byte regex of $macro_sheet_h2 = {85 00 ?? ?? ?? ?? ?? ?? 02 01}. The signature begins with the matching BIFF record id for BoundSheet8 (85 00), then ignores the size (2 bytes) and the lbPlyPos record (4 bytes). It then matches the hsState field (02) followed by the byte indicating that the sheet is a macro sheet (01). This is a reasonable match for sheets that follow the BIFF8 specification.

Fiddling with BIFF Records

However, there are a few tricks to essentially dodge this signature component by abusing flexibility in the specification. For example, the hsState field is only supposed to be represented by 2 bits – the remaining 6 bits of that byte are reserved. Theoretically this means that touching these bits should invalidate a spreadsheet, but this is not what happens in practice. Say we replaced the value 02 (b’00000010 in binary) with a different value by flipping some bits (b’10101010) like AA – would Excel also treat that as a hidden sheet? I can’t speak for all versions of Excel, but in my testing with Excel 2010 and 2019, the answer is yes.

Essentially, by following the majority of the specification, but not following the exact way that Excel has traditionally generated these documents creates an entirely new set of Excel binary sheets which bypasses most static signatures. The remainder of this blog post will focus on a few examples of abusing the BIFF8 specification to create alternate, but valid, Excel documents.

Label (Lbl) Records

Lbl records are used for explicitly naming cells in a worksheet for reference by other formulas. In some cases, Lbl records can contain macros or trigger the download and execution of other macros. From a malicious macro author’s perspective, though, the most likely usage of a Lbl record is to define the Auto_Open cell for their workbook. If a workbook has an explicitly defined Auto_Open cell then, once macros are enabled, Excel will immediately begin evaluating the macros defined at that cell and continue evaluating cells below it until a HALT() function is invoked. Understandably, the existence of an Auto_Open Lbl record is considered fairly suspicious, so there are a number of workarounds attackers have taken to hide their usage of this functionality. Let’s see if there are some other evasion techniques hiding in the Lbl record specification:

The Lbl record is a big structure with plenty of room for abuse

By default, when an Auto_Open label is defined in a BIFF8 document, it has its fBuiltin flag set to true, and its name field set to the value 01, indicating that this is an Auto_Open function. The first 17 bytes of this record (21 if you include the 4 byte header) can likely be used as a signature to identify usage. This does assume a lack of meddling with the reserved bytes which default to 00, but signatures could probably replace these with wildcard bytes and not pick up too many false positives. Given that normal labels are never going to have a single byte value of 01, there is a very small chance of triggering false positives with this as well.

A default Lbl entry for Auto_Open

If a user attempts to save any variation on the Auto_Open label (like alternative capitalization AuTo_OpEn), Excel will automatically convert it back to the shortened fBuiltin version shown above. However, when Excel opens an OOXML formatted workbook there is no equivalent shorthand record for Auto_Open, it is simply stored as a string. So what happens if we explicitly create a Lbl record, leave fBuiltin as false, and give it a name of Auto_Open?

A Lbl record with fBuiltin flipped to false, and the Name field set to Auto_Open

If a Lbl record is generated with these properties and inserted into an Excel document, Excel will still treat the referenced cell as an Auto_Open cell and trigger it. So we can create a label that triggers Auto_Open behavior but doesn’t look like the default record. This is a good start, but once a technique like this became well known it would also be vulnerable to a quick signature. As is, there are already plenty of AV solutions that will explicitly look for the Auto_Open string since attackers have been abusing this in OOXML documents in the wild.

An example of an OOXML document abusing Excel’s flexible Auto_Open parsing

Excel is surprisingly flexible when it comes to considering a text field matching the Auto_Open label – apparently the application only checks if the label starts with the string Auto_Open. This results in maldocs with labels like Auto_Open21. In fact, if you use Excel to save a label with name like Auto_Open222, it will actually save the record using a combination of the fBuiltin flag, and then append the extra characters, as can be seen below.

How Excel saves the label Auto_Open222 – note it maintains the fBuiltin flag (20) and doesn’t include the Auto_Open text, just the 0x01 indicating Auto_Open

Appending characters is great, but can we inject additional characters into the Auto_Open string in a way that Excel will still read it? A common trick in bypassing input validation is to try injecting null bytes to see if it results in the string being terminated early. Occasionally null bytes are also good for changing the length of a string without affecting its value.

The Auto_Open label with null bytes injected
How Excel’s Name Manager renders this Lbl record

Excel will actually give us the best of both worlds, from an attacker perspective, when injecting null bytes. The Auto_Open functionality will remain intact and still trigger for the cell we specify, but the Name Manager will not properly display any part of the name after the first null byte. Additionally, our Lbl record’s name data will not be easily match-able with a predictable regex.

The rabbit hole actually can go deeper than just null byte injection, however – the Name field in Lbl records is represented by a XLUnicodeStringNoCch record. This record allows us to specify strings using either (essentially) ASCII or UTF16 depending on whether we set the fHighByte flag. Besides further breaking any signatures relying on a contiguous Auto_Open string, the usage of UTF16 opens a whole new world of string abuse to attackers.

Unicode is traditionally a parsing nightmare in the security space due to inconsistent handling of edge cases across implementations. Excel is no exception to this, and it appears that when an unexpected character is encountered, the label parsing code will simply ignore it. From testing it appears that any “invalid” unicode character found will be skipped entirely. There are likely exceptions to the rule, but it appears that any entry that claims to be an invalid combination on fileformat.info can be injected into XLUnicodeStringNoCch records without impacting parsing. For example, if we build a string like "\ufefeA\uffffu\ufefft\ufffeo\uffef_\ufff0O\ufff1p\ufff6e\ufefdn\udddd", this will still trigger the Excel Auto_Open functionality.

After some fun with Unicode this looks VERY different from our initial Lbl record

This could be combined with null byte injection to hide the manipulation from the Name Manager UI entirely, or the Lbl record’s fHidden bit could be set to stop it from appearing in the Name Manager entirely. The ability to inject an arbitrary amount garbage in between letters in the Lbl name significantly increases the difficulty of building a reliable signature for this technique.

The Rgce and Ptg Structures

Let’s revisit the YARA rule from earlier, specifically the part for detecting usages of =CHAR(#):

$char_func = {06 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 1E 3D  00 41 6F 00}

This signature is keying on the beginning of a Formula record, and then the CellParsedFormula structure towards the end. CellParsedFormula structures contain three things:

  1. cce – The size of the following rgce structure
  2. rgce – The actual structure containing what we’d consider to contain the formula
  3. rgcb – A secondary structure containing supporting information that might be referenced in rgce

So what on earth is an Rgce structure? Why it’s a set of Ptg structures of course! Ptg structures, short for “Parse Thing”, are the base component of Formulas. While one might expect to find a string representation of a formula like =CHAR(61), this wouldn’t mesh with BIFF8’s hyper-focus on reducing file size. Each formula is represented as a series of Ptg expressions which describes a small piece of what a user would consider to be a formula. For example, =CHAR(61) is in fact two components – a reference to the internal CHAR function, and the number 61. Each of these representations has a corresponding Ptg structure.

The CHAR function is represented by a PtgFunc, a Ptg record which contains a reference to a predefined list of functions in Excel known as the Ftab.

The Ftab value table specifying that 0x6F is the CHAR function

The number 61 is represented by a PtgInt structure which is just the standard Ptg header and an integer with the value of 61:

Many Ptg records, like the PtgInt, are fairly straightforward

As a result, we end up with the binary signature of 1E 3D 00 41 6F 00 (41 is the Ptg number for PtgFunc). One thing that might stand out here, however, is the fact that the ordering of this data seems backwards – the PtgInt(61) data is stored before the PtgFunc(CHAR) data.

This is because Ptg expressions are described using Reverse Polish Notation (RPN). RPN allows for quick parsing of a series of operators and operands without needing to worry about parentheses, items are processed in the order they are read. For example: 3 4 − 5 + represents taking the value 3 and 4, then applying the subtraction function to those values to get -1. The value 5 is taken and the addition function is applied to -1 and 5, resulting in 4. This mentality is useful for stack-based programming languages, and it is used here to simulate what is essentially a stack of Ptg expressions. In our example here, the operand PtgInt(61) is popped off the stack, then the PtgFunc(CHAR) is applied to it.

The reason this is relevant is because the RPN stack-based format of Ptg structures allows us to easily create some very obfuscated expressions without needing to worry about their binary representation. For example, Microsoft Defender blocks all =CHAR(#) expressions – but what if we write a formula like =CHAR(ROUND(61.0,0)). This function is essentially the same, but ends up being represented very differently at the byte level:

The bytes of our new Formula’s rgce

The rgce listed here is now PtgNum(61.0), PtgInt(0), PtgFunc(ROUND), PtgFunc(CHAR). As an added “bonus”, PtgNum represents its data as a double, so the value of 61 is represented as 00 00 00 00 00 80 4E 40. Embedding a function has also completely changed the order of our Ptg structures such that the bytes of PtgFunc(CHAR) and PtgNum(61.0) are no longer adjacent. The original signature of 1E 3D 00 41 6F 00 is no longer tracking this Formula.

In short, the rgce block is ideally designed from a malware author’s perspective. There are numerous ways to represent the exact same functionality that look completely different from a static analysis perspective. The byte layout of the rgce block is also highly sensitive to change, turning a single value into a function invocation can rearrange the order of all other Ptg bytes within the expression.

Introducing Macrome

Much of the work necessary for testing some of these methods involved manually writing XLS files rather than using Excel. While there are plenty of tools for reading the BIFF8 XLS format, good tooling for manually creating and modifying XLS files doesn’t appear to be as common. As a result, I’ve created a tool for building and deobfuscating BIFF8 XLS Macro documents. This tool, Macrome, uses a modified version of the b2xtranslator library used by BiffView.

Macrome implements many of the obfuscations described in this blog post to help penetration testers more easily create documents for phishing campaigns. The modified b2xtranslator library can be used for research and experimentation with alternate obfuscation methods. Macrome also provides functionality that can be used to reverse many of these obfuscations in support of malware analysts and defenders. The tool was originally going to include functionality to process macros to help bypass obfuscated formulas, but @DissectMalware has already created a fantastic tool called XLMMacroDeobfuscator which goes above and beyond anything I was planning on dropping. It’s really a great piece of tech that I’d recommend anyone who has to analyze these kinds of documents.

I’ll be posting in the future about how to further expand Macrome and implement your own obfuscation and deobfuscation methods. In the meantime, please give the tool a try at https://github.com/michaelweber/Macrome. If you have any suggestions or feature requests please let me know here or open an issue!