Introduction to Reverse Engineering with Python
Python is a High-level language. Though many of you may think it is a programming language, it is not. It is a Scripting language. It comes nowhere near to the machine code or machine language. Then what is it that makes python so interesting? If you have any experience in pentesting or have had a conversation with many web security experts or malware analyzers, they would always suggest python as their primary language to develop malware or exploits.
While some people may prefer C, C++ or Perl, I would personally vote for python. The reason being it is not only useful as a tool for writing a program, but it is also useful to break it.
What is Reverse Engineering?
Reverse Engineering is a very broad concept. One cannot actually define it with simple syntaxes. Reverse Engineering’s proper concept is to break down a code into simpler parts, understand it, modify and enhance it as per our own purpose, and then reassemble it to fit to our needs. To make it a bit simpler, let me give you some extremely common examples.
Let us take an example of an Android Cell Phone. Manufacturers Create a Stock Rom and Sell it to their Consumers. But most of the times, it contains a lot of bloatware, and it becomes laggy. So, there are people on websites like XDA and androidcentral who reverse engineer their way into the ROM, enhance it and make it lag proof. One Practical example would be CyanogenMod Rom.
But this was just an example to make you understand what it is. Reverse Engineering has the same concept, but it’s way too complicated than just modifying a ROM.
Compilation and Python
If you have experience in Python, then you know that when writing a python script, may it be a virus, payload, trojan or whatever the file is, it will only work in computers in which python is installed. So, let us say I have written an excellent computer trojan that can bypass any Anti-virus, and I start to deploy it in a windows system, but if the windows system does not have the python interpreter installed, then it won’t work. So, one needs to compile every file of the written python script into an executable and then deploy it in the windows system.
Reverse Engineering Windows Executables
Now you know that we need to compile python scripts to run in windows; you must also be knowing that there needs to be a compiler, which converts python scripts into an executable. Yes, there is. Its name is Py2exe. Py2exe is a simple installer that converts python scripts into standalone windows programs. Now, there is also another tool that converts Windows executables written in python back to python scripts. Its name is Pyinstaller Exe Rebuilder.
Pyinstaller exe rebuilder is a tool to recompile/reverse engineer pyinstaller generated executables without having access to the source code. When you launch the EXE – it is unpackaged in memory. This includes the .pyc files (python code that is converted to bytecode). Basically, what tools like pyinstaller and py2exe do is package libraries and dependencies all together so you can run the ‘stand-alone’ EXE without having to download them or prepare the machine with a python interpreter.
There is also another toolkit that takes you to very near to the source code. The name is PyRetic which stands for Reverse Engineer Obfuscated Python Bytecode. This toolkit allows you to take an object in memory back to source code without directly accessing the bytecode on disk. This can be useful if the applications pyc’s on disk are obfuscated in one of many ways.
Reverse Engineering The Hard Way
Now the above part is easy to understand and practically do it when you atleast the basic knowledge in python. But thats not always the case. Sometimes, you don’t have any documentation or comments in the python script, and there are too many files for you to understand all by yourself. Now there is an awesome book on this part, but I won’t be concentrating much on that.
The name of the book is “Working Effectively with Legacy Code”. The book is independent of python or any other language and will give you an idea for reverse engineering in almost any language. When trying to understand a piece of code, the key focus is the reason you want to understand it.
Whether you want to reverse engineer the code to modify it or to port it, the approach for both would be quite different. So, instrumenting the legacy code with batteries and scaffolding tests, and tracing/logging is the crucial path on the long, hard slog to understand and modify safely and responsibly.
Reverse Engineering Tools
Now there is another method to make it a bit easy, which you can follow along with following the above steps. There is a site called as Epydoc. On this site, I will check the code and create some documentation for it. The result will not be as good as the original documentation, but it will atleast give you an idea as to how it works exactly. And by doing this, you can start writing your own documentation, and after partially writing the document, you can again generate the remaining partial document from the site for the remaining part.
You can even use the IDE tool to analyse the code. This typically gives you code completion, but more importantly, in this case, it makes it possible to just ctrl-click on a variable to see where it comes from. This really speeds things up when you want to understand other peoples code.
Also, you need to learn a debugger. In tricky parts of the code, you will have to step through them in a debugger to see what the code actually do. Pythons pdb works, but many IDE’s have integrated debuggers, which make debugging easier. PyReverse from Logilab and PyNSource from Andy Bulka is helpful too for UML diagram generation.
There is a process to produce a UML class model from a given input of source code. With this, you can reverse a snapshot of your code-base to UML classes and form a class diagram in further. Bringing code content into a visual UML model helps programmers or software engineers review an implementation, identify potential bugs or deficiencies, and look for possible improvements.
Apart from this, developers may reverse a code library as UML classes and construct a model with them, like to reverse a generic collection framework and develop your own framework by extending the generic one. In this chapter, we will go through the instant reverse of Python.
Objects and Primers
To fully understand Python’s inner workings, one should first become familiar with how Python compiles and executes code. When code is compiled in Python, the result is a code object. A code object is immutable and contains all of the interpreter’s information to run the code. A byte code instruction is represented as a one-byte opcode value followed by arguments when required. Data is referenced using an index into one of the other properties of the code object.
A byte code string looks like this:
Python byte code operates on a stack of items. A more enterprising extension would be to attempt to decompile the byte code back into readable Python source code, complete with object and function names. Python code can be distributed in binary form by utilizing the marshal module. This module provides the ability to serialize and deserialize code objects using the store and load functions.
The most commonly encountered binary format is a compiled Python file (.pyc) which contains a magic number, a timestamp, and a serialized object. The Python interpreter usually produces this file type as a cache of the compiled object to avoid having to parse the source multiple times. These techniques rely on the ease of access to byte code and type information.
With a code object’s byte code, code logic can be modified or even replaced entirely. Extracting type information can aid in program design comprehension and identification of function and object purposes.
The obfuscation and hardening of application byte code will always be a race between the implementers and those seeking to break it. To attempt to defend against byte code retrieval, the logical first step is towards a runtime translation solution.
Properties of a code object could be stored in any signed, encrypted, or otherwise obfuscated format that is de-obfuscated or translated during runtime and used to instantiate a new object. One could even change the way variable name lookups work within the interpreter to obfuscate naming information. A developer could further mitigate reversing attempts by adding a translation layer between the lookup of the actual names and the names within the source code.
Now, after reading all these, you may feel the need to go and experiment with a few of the tolls out there. So, here are some tools that can help you reverse engineer your way into your python code:
- The Carrera Collection
All of these are great pieces of code but what really makes them outstanding is when they are used together. Keep in mind this is by no way a complete list, just the ones that I use the most and think show how the flexibility of python can make such a complex task such as reverse engineering manageable.
Here are some articles that will help you to get more detail about Reverse Engineering with Python, so just go through the link.