Introduction to Reverse Engineering with Python
Python is a High level language. Though many of you may think, it as a programming language, it is not. It is a Scripting language. It comes nowhere near to the machine code or machine language. Then what is it that makes python so interesting? If you any experience in pentesting or you have had a conversation with many web security experts or malware analyzers, then they would always suggest python as their primary language to develop malwares or exploits.
While some people may prefer C, C++ or perl, I would personally vote for python. The reason being it is not only useful as a tool for writing a program, it is also useful to break it.
What is Reverse Engineering?
Reverse Engineering is a very broad concept. One cannot actually define it with simple syntaxes. Proper concept of Reverse Engineering is to break down a code into simpler parts, understand it, modify and enhance it as per our own purpose and then reassemble it to fit to our needs. To make it a bit simpler, let me give you some extreme common examples.
Lets take an example of an Android Cell Phone. Manufacturers Create a Stock Rom and Sell it to their Consumers. But most of the times, it contains lot of bloatware and it becomes laggy. So, there are people in websites like XDA and androidcentral who reverse engineer their way into the ROM, enhance it and make it lag proof. One Practical example would be CyanogenMod Rom.
But this was just an example to make you understand what it is. Reverse Engineering has the same concept but its way too complicated that just modifying a ROM.
Compilation and Python
If you have experience in Python, then you know that when writing a python script, may it be a virus, payload, trojan or whatever the file is, it will only work in computers in which python is installed. So, lets say, I have written an excellent computer trojan which can bypass any Anti-virus, and I start to deploy it in a windows system, but if the windows system does not have the python interpreter installed, then it wont work. So, one needs to compile every file of the written python script into an executable and then deploy it in windows system.
Reverse Engineering Windows Executables
Now you know, that we need to compile python scripts to run in windows, you must also be knowing that there needs to be a compiler, which converts python scripts into an executable. Yes, there is. Its name is Py2exe. Py2exe is a simple installer which convert python scripts into standalone windows programs. Now, there is also another tool which converts Windows executables written in python back to python scripts. Its name is Pyinstaller Exe Rebuilder.
Pyinstaller exe rebuilder is a tool to recompile/reverse engineer pyinstaller generated executables without having access to the source code. When you launch the EXE – it is unpackaged in memory. This includes the .pyc files (python code that is converted to bytecode). Basically what tools like pyinstaller and py2exe do is package libraries and dependencies all together so you can run the ‘stand-alone’ EXE without having to download them or prepare the machine with a python interpreter.
There is also another toolkit which takes you to very near to the source code. The name is PyRetic which stands for Reverse Engineer Obfuscated Python Bytecode. This toolkit allows you to take an object in memory back to source code, without needing access to the bytecode directly on disk. This can be useful if the applications pyc’s on disk are obfuscated in one of many ways.
4.8 (7,864 ratings)
View Course
Reverse Engineering The Hard Way
Now the above part is easy to understand and practically do it when you atleast the basic knowledge in python. But thats not always the case. Sometimes, you dont have any documentation or comments in the python script, and there are too may files for you to understand all by yourself. Now there is an awesome book on this part, but I wont be concentrating much on that.
The name of the book is “Working Effectively with Legacy Code”. The book is independent of python or any other language, and will give you an idea for reverse engineering in almost any language. The key focus, when trying to understand a piece of code, is the reason why you want to understand it.
Whether you want to reverse engineer the code to modify it, or to port it, approach for both would be quite different. So, instrumenting the legacy code, with batteries and scaffolding of tests and tracing/logging is the crucial path on the long, hard slog to understanding and modifying safely and responsibly.
Reverse Engineering Tools
Now there is another method to make it a bit easy which you can follow along with following the above steps. There is a site called as Epydoc. In this site, will check the code and create some documentation for it. The result will not be as good as the original documentation, but it will atleast give you an idea as to how it works exactly. And by doing this, you can start writing your own documentation, and after partially writing the document, you can again generate the remaining partial document from the site for the remaining part.
You can even use the IDE tool to analyse the code. This typically gives you code completion, but more importantly in this case, it makes it possible to just ctrl-click on a variable to see where it comes from. This really speeds things up when you want to understand other peoples code.
Also, you need to learn a debugger. You will, in tricky parts of the code, have to step through them in a debugger to see what the code actually do. Pythons pdb works, but many IDE’s have integrated debuggers, which make debugging easier. PyReverse from Logilab and PyNSource from Andy Bulka are helpful too for UML diagram generation.
There is a process to produce UML class model from a given input of source code. With this, you can reverse a snap shot of your code-base to UML classes and form class diagram in further. By bringing code content into visual UML model, this helps programmers or software engineers to review an implementation, identify potential bugs or deficiency and look for possible improvements.
Apart from this, developers may reverse a code library as UML classes and construct model with them, like to reverse a generic collection framework and develop your own framework by extending the generic one. In this chapter, we will go through the instant reverse of Python.
Objects and Primers
To fully understand the inner workings of Python, one should first become familiar with how Python compiles and executes code. When code is compiled in Python the result is a code object. A code object is immutable and contains all of the information needed by the interpreter to run the code. A byte code instruction is represented as a one byte opcode value followed by arguments when required. Data is referenced using an index into one of the other properties of the code object.
A byte code string looks like this:
\x64\x02\x64\x08\x66\x02
Python byte code operates on a stack of items. A more enterprising extension would be to attempt to decompile the byte code back into readable Python source code, complete with object and function names. Python code can be distributed in binary form by utilizing the marshal module. This module provides the ability to serialize and deserialize code objects using the store and load functions.
The most commonly encountered binary format is a compiled Python file (.pyc) which contains a magic number, a timestamp, and a serialized object. This file type is usually produced by the Python interpreter as a cache of the compiled object to avoid having to parse the source multiple times. These techniques rely on the ease of access to byte code and type information.
With a code object’s byte code, code logic can be modified or even replaced entirely. Extracting type information can aid in program design comprehension and identification of function and object purposes.
The obfuscation and hardening of application byte code will always be a race between the implementers and those seeking to break it. To attempt to defend against byte code retrieval, the logical first step is towards a runtime translation solution.
Properties of a code object could be stored in any signed, encrypted, or otherwise obfuscated format that is de-obfuscated or translated during runtime and used to instantiate a new object. One could even change the way variable name lookups work within the interpreter to obfuscate naming information. By adding a translation layer between the lookup of the actual names and the names within the source code, a developer could further mitigate reversing attempts.
Conclusion
Now, after reading all these, you may feel the need to go and experiment out a few of the tolls out there. So, here are some tools which can help you reverse engineer your way into your python code:
- Paimei
- Sulley
- The Carrera Collection
- PyEmu
- IDAPython
- ImmDbg
All of these are great pieces of code but what really makes them outstanding is when they are used together. Keep in mind this is by no way a complete list, just the ones that I use the most and think show how the flexibility of python can make such a complex task such as reverse engineering manageable.
Recommended Articles
Here are some articles that will help you to get more detail about Reverse Engineering with Python so just go through the link.