Malware Analysis 101

October 13, 2020

Malware Analysis 101

So, a bit of a preface, the following is a presentation that I did on the basics (and I do mean basics) of malware analysis, it's a little power point presentation along with my notes on the slides. I do not by any means to profess to be any sort of expert on Malware Analysis though I have done some before at a professional level. This is just to get you down the path and thinking about it. Please do not blow malware up on your corporates network and tell your boss the talking security cat said it was fine

To cover this I'm going to step through 8 key areas to malware analysis, with memory forensics probably being the weakest area, mostly because I honestly have not run into any commercial grade malware (yet) that has required me to start actually sifting through memory.

If you don’t know why the OS works the way it does, then you will never actually understand what malware may or may not be doing. This covers everything from what processes you can expect to run on a windows vanilla machine all the way to understanding that the Registry Hives are basically just 5 different giant INI files laying around in System32. I strongly suggest understanding the basics of Linux and Unix Systems and working your way into Windows.

The reason for this is that Windows at it’s core has a lineage in Unix,, and while it may be different the core and fundamental principals are still there. Directories, File System Structures, Accounts, logging, IO, and more all ultimately come back to Unix in some form or fashion and thus if you understand how Unix processes these task you can get a better handle on how windows has built upon it. It is for that reason that I suggest studying Linux to better understand computers in general. The fact is that every device from HMI’s to Servers and even phones are in some form or fashion distant relatives or cousins of those first operating systems.

On the windows specific front I suggest going through the SANS Hunt Evil Poster which covers a wide variety of topics when it comes to how and why Windows operates the ways that it does. Everything from .dll search orders (and how to hijack those) all the way to the differences between the Registry’s CurrentVersion.

Windows is a complex beast, and attempting to explain it all in a single slide is nigh impossible, so I will simply say that the key is understanding that before you can even really get into malware analysis you should study for how and why the Operating System works the way that it does, so that when you do static or dynamic analysis of a piece of malware you can know what it’s doing to the underlying operating system

Coding at it’s core can be broken up into 3 realms

Scripted Code is Code that you can just read, the system has some kind of on the fly compiler and so any code that falls under this branch doesn’t require any sort of tool to reverse engineer. You can simply open the code in a text editor and read through it to understand what the code is doing

JavaScript (not to be confused with Java) is a language used by Browsers (Firefox / IE / etc) in order to give web pages life so to speak, so that when you click a button something happens on the page. It can be as simple as the code on a web page all the way to the OS altering node.js / react / angular. Keep in mind though that in the case of malicious javascript almost all of it will be browser based
PHP while you can’t technically see this code if you are looking at the web page, if you have access to the server it’s as readable as English with .php files
Python – this needs Python to be installed on the OS (It comes with MacOS and MOST linux distributions already)
Powershell and Bash are both scripting languages for their relevant OS’s. They have a number of tools and functions designed to let them directly interact with the OS. Powershell by and large should be considered just as if not more feature complete then C# or any compiled .net language if constructed properly.

Compiled Code is code that in order to get into it’s human readable form is going need to be “Decompiled”. Often times these are put through packers (special code intended to help shrink or optimize the code behind the scenes. Keep in mind that decompiled code will often lose its comments and other helpful things (like the names of a given variable) in the name of optimization.

C++ - One of the oldest (Current) compiled languages that is still floating around. Attempts to migrate away from this language have been ongoing as there are a number of issues inherent to the way that C++ handles memory; however, most large scale commercial products at some point have C++ (World of Warcraft, etc.). Depending on the compiler it can often produce binaries that run on nearly any system
C# / ASP.net – Microsoft’s take on the C++ coding language, Object oriented
Java – Developed in order to make an easy method for getting a language from one OS to another. Java is effectively a tiny virtual machine on your OS that allows it to run it’s code. It can run on Mac, Windows, and Linux so long as Java is installed and by and large will operate the same on each
Actionscript – This is the language behind Flash. Flash is important because for years it was the defacto standard for running applications from a web app. As such it has a number of ways to interact with the local OS, making it potentially dangerous if abused. For instance while Javascript may not be able to get your IP or information about your processer, a piece of Flash could run and you’d never know it
GoLang – Developed by Google and referred to simply as “Go”, it represents a new family of programming languages (like RUST) where far more emphasis has been put into security from the onset. As such decompiling these can often be incredibly difficult

Assembly – As the horrible Warcraft monster suggest, Assembly is an entirely different beast all together. At it’s core Assembly is the Hex core of a binary, it is the so called 1s and 0s that an Operating system actually understands. Anything can ultimately be taken back to assembly, and often times in static analysis when all else has failed this is where you will end up. Understanding assembly means understanding how and why a processor works the way it does. It means understanding the concept of “The Stack” and functions on pushing and pulling things on and off that stack. The great thing about assembly though is that no matter what, at the end of the day you can see exactly what a program is intended to do. Each microarchitecture is slightly different (ie Assembly for Windows might be different then one for Mac or even for different processors as there are different Symbol types involved. I am not knowledgably on the subject matter enough to truly speak to walking someone through assembly, but you can view it as a sort of last resort to programming )

w3Schools.com is a good primer in general for a variety of programming languages, and at the very least, most of the programming languages that you’ll run into in the wild. While there are some specifics of a given language (eval() in Javascript). In general most of the principals that you learn can help you look at really any non assembly language and get a feel for what it’s attempting to do. While some languages like PHP have some odd syntax to their commands, most functions are fairly straightforward across the board.

Binary / Hex – You need to be able to see x0F and realize that is 15 or looking at 0100 and understanding that is 4. You should probably also keep an ASCII table handy or at the very least. Often times attackers will mask thins in Binary or Hex in an attempt to obfuscate code, so being able to quickly recognize that can be important. Base64 and URL encoding are also common. Check CyberChef https://gchq.github.io/CyberChef/ to quickly convert things

IF(TRUE)

//Do Function

ELSE

//Do Other

Basic programming logic is key here, and understanding that functionally all programming languages operate on the same basic principals. If this, else that. Case statements

CASE A

//Do Function

BREAK

CASE B

//Do Other Function

DEFAULT

//Do nothing

BREAK

Loops are also critical to understanding programming

WHILE(X)

//Do Y

In addition you will want to understand the concept of variables (just like algebra); however, unlike algebra a variable can be anything from a number to a sentence. Understanding the various “Common” variable types is key to understanding how they are used

Strings – Variables that contain text (ie “The quick brown fox jumps over the lazy dog”)

Int – Variables that are numbers (ie. 1 – 10)

Float – Variables that may contain numbers with decimal points (ie 1.5)

Boolean – True or False (1 or 0)

Each language will have its own series of types in general, and some languages don’t even require types. For instance Python, PHP, and Javascript don’t require types to be assigned to variables. Instead the compilers for each of those languages simply work to figure out what variable type was intended. Following that there is the more complex variable types like objects and arrays.

Arrays can be fairly simple, think of them as a collection of variables. So for instance

ARRAY A = [“CAT”,”DOG”,”FISH”]

A[1] // Would output Dog

While things can get more complex, it is important to note that MOST languages (read all but some extremely edge cases like mat lab) arrays start at 0

Objects on the other hand you can think of as a variable that contains variables and functions (it is an instance of a class) so

CLASS A

FOO = “Bar”

FUNCTION B

//DO Function C

VAR B = NEW OBJECT A

B.FOO // Outputs Par

B.B() //Does whatever Function C is expecting

While MOST of the malware I’ve run into doesn’t really get complex enough to worry about objects and what we call object oriented programming, it isn’t impossible that you might see it.

Regular Expressions are common across virtually all languages as a “Pattern”, for more information on them I strongly recommend https://regexone.com/

Functions and Classes are fairly straight forward, think of them as reusable code (like a tiny little API call that you make all on your own).

Environmental Variables are things stored on the machine itself. So for instance in Powershell $ENV:APPDATA will resolve to your current AppData folder

Finally compilers are how you take a compiled language (C++ / etc) and convert it into a runnable binary file that the Operating System can understand

File extensions are important, and you can find a whole database of them at https://www.garykessler.net/library/file_sigs.html

Those file extension headers are how Windows and other operating systems know how to handle a given filetype. Though there is some amount of trusting the extension (IE Windows may not render a .png if you rename it to a .jpg, etc.) but it is worth knowing what a given file was supposed to be. The most common file extension that a given cybersecurity analyst probably wants to be aware of are the .exe extensions. These are marked with 4D 5A (MZ). Why MZ you might ask? Well funny story it’s simply the initials of Mark Zbiokowski, who was one of the leading developers of MS-DOS.

Linux has a shrug, because by and large it depends entirely on the kernel and distro you are running with. A .deb doesn’t actually mean anything to a non debian based build just the same as a .rpm doesn’t mean much to a debian build. The only truly universal filetype that meanders about are .tar.gz (Tarball / Gzipped) which is the modern day Linux’s version of a .zip. Beyond that most everything in the linux world is a text file of some kind, for any given binary you’re probably going to need a corresponding environment (Python / java /etc ) to go with it.

But there’s a myriad of file types, and the reason for knowing them is more in line with knowing what the user might expect to see

.bmp / .jpg / .png / .gif / .svg – Images
.docx / doc / .docmx - Documents
.exe – Windows Executable
.app – Apple executable
.dll – “Dynamic Link Library” which is a fancy way of saying an exe that can’t run. Generally these are chock full of API calls
.ini – A text file that usually contains configuration information
.ps1 / .bat – Windows Powershell and CMD line shell files
.dmg – Mac Installer
.htm / .html – HTML file
.js – Javascript file (be weary the Windows scripting engine, unless told by a GPO to do so will just run these)
.php – PHP is a web scripting language
.py – Python, these files need to have some kind of python environment to run.
.plist – The Mac configuration file (think of it like the windows equivalent of a .ini )
.lnk – A shortcut pointer, your users don’t want to be bothered with remembering a filepath, so Windows provides the .lnk file type.
.class – 9 times out of 10 if you’re seeing this, there is probably some Java involved somewhere
.pdf – Adobe’s famous PDF format, be aware there is a TON of things this format is supposed to be capable of doing (hence the constant vulnerability discoveries)
.zip / .7z / .tar.gz – Compressed files
.eml – Emails

The real key here is having a virtual environment that can give you some kind of snapshotting feature. In addition it’s strongly advised that you consider a Hypervisor that can handle making separate network segments (this is not possibly by default with Virtual Box).

ESXI is free but requires the hardware to run on, as it is a bare metal hypervisor and the only manner for access would be through a tool like the ESXI web console. However, of the virtual offerings this is the most feature rich of the hypervisors including vlan creation, virtual switch creation, snapshots and more.
HyperV is included with any copy of Windows 10 Professional and above, it almost matches ESXI in terms of features though HyperV does have a few known VM Escape vulnerabilities that have been published which raises questions about how secure it actually is. This is the technology that the Windows 10 Sandbox released with the most recent Windows 10 build is based on
VMWare Workstation and VMWare Fusion are effectively the same product, Fusion is simply the MacOS equivilant. They feature some limited network capabilities in terms of placing things on their own vlan’s etc, enough so to build multiple networks. The catch however is that these do cost money.
Virtual Box has limited ability to do separate networks, so in honeynet testing it may not be as useful, but it has snapshots, which is really all that you need to have going for you (Or at least as of more recent builds it did)

Second is the host OS, for Dynamic analysis, one of the biggest things is making sure that your malware is actually going to run on our “Victim” OS

Windows 10 or Windows 7, use whichever happens to be more applicable to your environment. These days, Microsoft simply allows you to download the Windows 10 .iso directly from their website, so one might as well take advantage of that fact.

Remnux is a set of tools aimed at static analysis of various malware components. You can get it from https://remnux.org/

Don’t forget to install the actual environments that regular users have on your victim OS. That means Flash, Java, Adobe Reader, Microsoft Office, etc. The big thing is to try to keep the patching in line with your own environment. Remember you want to see how this malware behaves against your machines by and large. If you are looking to simply analyze at a base level you may wish to make a snapshot of certain versions or patch levels

Finally the tools. All of these tools are aimed at the dynamic analysis side of things. Keep in mind that when you are doing dynamic analysis you are actively running the malware. This means the malware will absolutely detonate. As such you need to be absolutely confident at that point that your virtual environment won’t let it escape. In the VAST majority of cases (99.9%) the malware is probably going to be aimed at windows. As such we’ll take a look at the windows side of things

Procmon – Process Monitor, and part of the Windows “Sysinternals” suite. This will give you a flow of all events that take place on the OS. From file access to registry changes all the way to network traffic.
Process Explorer – Like Procmon but providing a live view. Think of this like a very complex and advanced form of task manager. One of the best features here is the ability to run strings against a process while it is in memory meaning you can see what it unpacks to
Regshot – Taking a snapshot of the registry before and after the execution of the malware to give you the ability to compare and contrast exactly what it was the malware has done
Wireshark – The ability to see and watch network traffic on the host giving you a valuable pcap and the ability to see calls to possible IOC’s
InetSim – Running on things like Remnux, this product can simulate responses to network request allowing you to see how it might behave if it got more then simply a failed response
AutRuns – This is an easy way to see all the things intended at startup from scheduled task to the runonce registry keys. This is a great tool for trying to find an IOC of persistence

Make sure you have logging on that host turned on, this is beyond critical, I strongly encourage you to turn on powershell logging on your victim OS so you can simply review the event log to see exactly what piece of powershell was run.

Don’t be afraid to take a look at other tools as well, sometimes these samples have been seen in the wild. Be sure to check the hash at Virustotal and see what they might have on the malware. Often times they have their own sandboxes they run it through. In addition (at a $) you can make use of something like any.run which provides a massive toolkit to perform full dynamic analysis without having to setup your own lab.

The Book provided provides a hands on experience to helping you build your very own virtual machine laboratory. The keys to the guide are about best practices as it relates to preventing Memory Escapes as well as setting up your own little virtual network. There’s some suggestions in there to make use of a PFSense firewall. Keep in mind that the book assumes that you are going to be pentesting or trying to see how a pentest performs against a given machine; so the firewall may or may not fit depending on your needs. It can be a good way to segment off servers like Kolide.io’s Fleet, Splunk, and syslogging servers. In addition the book provides a number of tips and trips for making sure that your virtual environment doesn’t actually look like a virtual environment to the software at play. A lot of this has to do with masking VMware tools or the MAC address OUI’s used by VM appliances.

The key is pretty simple, you don’t want anything you’re doing in your virtual environment to escape out to the rest of your host. For this reason, if you can afford to do it, I suggest making your Host OS different then your detonation / victim OS. That way in the event that during dynamic or static analysis if it does manage to perform some kind of VM escape the OS it lands on won’t be the same one it just came from. This can lessen your chance of problems escaping the virtual environment at the very least. In addition the book go’s over how best to route your network in such a way that your virtual environment can’t use the same NIC as your host OS.

In my own lab, my host OS can make use of wired connection and the VMware environment can only make use of the Wireless NIC, with the option to use the other disabled both ways. As such the victim OS is never even aware of the Host OS on the same network. For even more segmentation you could simply eliminate the connection to the virtual lab. For my own environment I have a shared folder on my Host OS with the Remnux Virtual OS. The remnux virtual OS then serves as a gateway to the detonation OS and has no network connection of it’s own. In this way the victim OS is only aware of Remnux and the presented inetSim portion.

The key is to make this sandbox as isolated as possible, some of these malware samples could be incredibly dangerous and even their communications should at least be proxied or intercepted to ensure that they don’t necessarily reach out.

This is the middle ground that exist between Static and Dynamic analysis, as you can’t really gather memory at run time but you also can’t statically analyze it. Memory Forensics really only has one tool at the moment (Volatility), but it can reconstruct the entire OS at that very moment. This can give you insight into “NOP Slides” and similar techniques that attempt to take advantage of the way that memory and the stack works. In my experience at least, Memory Forensics tends to lean more into the DFIR portion of Cybersecurity then Malware Analysis.

If Dynamic analysis hasn’t provided you with the answer you are looking for, then it’s time to turn to the art of Reverse Engineering

This is an INCREDIBLY complex field and covers everything from changing eval() to console.log in Javascript all the way to dragging yourself through the Stack to find the key that was used in a piece of ransomware. Much like speaking about OS Internals, to cover this all in a single slide isn’t really possible. But to get into the basics of things, depending on what you need to do

Scripted Analysis

Notepad++ or Visual Studio Code is your friend. These can help provide syntax highlighting and allow you to simply use what you know of programming to mentally walk through a piece of malicious code.
Keep in mind F12 in Chrome which will open up the console menu. For Javascript specifically you can run the code live (if you so desire, though I strongly suggest defanging it or ensuring any evals() / etc. are removed or console.log ‘ged instead of run
A lot of this is simply going to come to the skill of programming and applying your own knowledge of a given language against what you are reading
The general theory behind a scripted analysis is that whatever it is can be opened in notepad and simply be read

Compiled Analysis

This is where you’re going to want to get your Decompilers like dotPeek or ilSpy which can quickly decompile code that was written in C# or some other .net family of languages. As most Windows executables are written in the .net style they can be quickly returned to their orginal source code. Each tool has it’s own host of documentation on how best to accomplish this
Some code (C++, etc.) can often be difficult to decompile or unpack. In such a case you are going to want to consider a disassembler like IDA or Ghidra. This can take the code base back to it’s assembly and allow you walk through everything step by step. Most of the time the programs are clever enough to figure out what the entry point should be

Debugging

Taking Compiled analysis a step further, debugging is a sort of hybrid of Static and Dynamic analysis. In debugging you are running the program but have inserted break points at specific points in the code. This allows you to manipulate values or flip a given flag so that you can step into a certain portion of code without having to meet a given condition.

Regardless of all this, often times you can use programs like ilSpy or the likes to see what system calls or .dll files a program loads. This can help you better understand from the system calls the sorts of capabilities present. Read up on each of the API calls of a given .dll to figure out what a program might be capable of accessing. A program that doesn’t include any .dll’s with internet access probably isn’t going to be making calls to the rest of the network

Often times, malware authors will intentionally make things difficult, and as such you should be prepared to spend more then a fair amount of time simply unwrapping things. From Base64 encoded command line entries to JavaScript that meanders through 3 miles of garbage code. Ridiculously long function and variable names with concatenation everywhere. CyberChef and Code Beautifiers will become your friend. There is a method to he madness and the best suggestion to treat it like a puzzle

Scripted Anti Measures

Change complex variables names back into something readable
Deobfusicate where you can, base64 decoding or revering TOChar methods
If the code executes in one spot, simply add a console.log or similar method so that you can get an idea of what it would have executed in a logged form instead
Look for keys (URL’s IP’s etc.)

Compiled Anti Measures

Packing can be countered with unpacking tools like PE.Explorer IF you know what was used to pack the executable in the first place
Debugging can be used to counteract certain flags being required

Virtualization Anti Measures

Some programs (like WannaCry) would make certain DNS calls, if the DNS entry actually returned it would stop execution, sometimes you have to let them talk to the real DNS servers and internet for them to execute
Some malware will check for common Vmware OUI’s on interfaces or Vmware tools as a running process
Complex malware will even search for common VMWare drivers being present on the device.

Dynamic

Programs may look for certain mutex’s (Shared memory space between programs) or other executables
Some programs may look for things like ProcMon (Either by name or Hash) or Wireshark

All of these measures each require a different and unique response to overcome BUT THEY ARE NOT COMMON. The most common thing you will see is obfuscated code, but most commercial grade malware simply wants to exploit the system and run without concern for malware analyst looking at it. That tends to occur with your more advanced ATPs or large crime groups

There are a WIDE variety of IOC’s and thousands of tools to actually make decisions based off of them.

YARA is one of the most commonly known methods, allowing you to target things to the hex level of a given executable to make decisions on if a file needs to be detected / removed. Common Indicators of Compromise include IP Addresses of C2 servers, URL’s associated with those, Domains etc. This is what you can feed your IR guys to make sure they know what they are looking for.