decompy.DataGathering package

Submodules

decompy.DataGathering.ClangSubprocess module

class decompy.DataGathering.ClangSubprocess.Clang[source]

Bases: object

Class to define functions for calling the Clang compiler

static compile_all(file_path, newlocation, out_type, args='')[source]

Compiles the C file given as a path with Clang, using the specified args. Writes to a file by calling compile_cfile then returns the specified location of the file path. If this is being used to filter the input files, the C files that successfully compile will be entered in.

Parameters:
  • file_path – File with list of C file names to compile
  • newlocation – location to save LLVM files to
  • out_type – the type that the file must be compiled to, such as “elf”
  • args – Arguments for the compiler to use while compiling
static compile_cfile(file_in, newlocation, output_type, args)[source]

Compiles the specified C file with Clang, using the specified args. Stores this file in the specified location and returns the location as a string. If this is being used to filter the input file and if the C file successfully compiles it will be entered in the filter file.

Parameters:
  • file_in – File to compile
  • newlocation – location to save LLVM files to
  • output_type – the type that the file must be compiled to, such as “elf”
  • args – Arguments for the compiler to use while compiling
static to_assembly(file_path, newlocation)[source]

Compiles the C file given as a path to x86 assembly. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – file path to compile
  • newlocation – location to save assembly files to
Returns:

the file location which llvm_unopt was saved to.

Return type:

str or None

static to_elf(file_path, newlocation)[source]

Compiles the C file given as a path to elf executables. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – file path to compile
  • newlocation – location to save LLVM files to
static to_llvm_opt(file_path, newlocation, optlevel='')[source]

Compiles the C file given as a path to optimized LLVM IR, at the specified opt level. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – File with list of C file names to compile
  • newlocation – location to save LLVM files to
Returns:

the file location which llvm_unopt was saved to.

Return type:

str or None

static to_llvm_unopt(file_path, newlocation)[source]

Compiles the C file given as a path to unoptimized LLVM IR. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – File with list of C file names to compile
  • newlocation – location to save LLVM files to
Returns:

the file location which llvm_unopt was saved to.

Return type:

str or None

static to_object_file(file_path, newlocation)[source]

Compiles the C file given as a path to an object file. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – File with list of C file names to compile
  • newlocation – location to save Object files to
Returns:

the file location which llvm_unopt was saved to.

Return type:

str or None

decompy.DataGathering.CreateLocalData module

class decompy.DataGathering.CreateLocalData.CreateLocalData(folder='Repositories', dest_folder='RepositoriesFiltered', database_name='c_code', repo_dict={'blacklist': ['C++', 'C#', 'css'], 'language': 'C', 'per_page': 100, 'search': 'C '}, repo_json_name='offlineResults.json', repo_json_filtered_name='filteredOfflineResults.json', filtered_repos=None, save_json='repo.json', config_file='config.json', repo_start_date=None, repo_end_date=None, verbose=True)[source]

Bases: object

Gathers the data and prepares to use it in stages. This utilizes the following files: RepoFilter, GitHubScraper, FilterC, and ClangSubprocess. These combined will get all the relevant data

all_stages_increment(start_date=None, end_date=None, start_page=1, end_page=3)[source]

runs all five stages in increments. :param start_date: date to start or pick back up formatted “%Y-%m-%d :type: str :param end_date: date to end on formatted “%Y-%m-%d :type: str :param start_page: page to start or pick up from. :type: int :param end_page: page to end or pick up from. :type: int :return: void

static change_stored_directory(folder, file_path)[source]

Changes the stored directory in the repo.json to the specified folder name. This is especially useful if you are changing the name from the recommended “Repositories” to another name. :param folder: the new folder to look for :param file_path: the file path to change :return: the new file path :rtype: str

stage1_gather_repo_meta(date, start_page, end_page)[source]

stage 1 of the data gathering process: Gather the data from the repos and store it into a json file. :param date: the date to read from :type: str :param start_page: where to start getting the data (start page) default is 1. :type: int :param end_page: where to end the page (end page) default is 2. :type: int :return: void

stage2_get_repos(test=False, username=None, password=None)[source]

stage 2 of the data gathering process: Scrape all the files from GitHub from the given offline json file. :param test: whether or not to test :type: bool :param username: the github username to download more data. :type: str :param password: the github user’s password. :type: str :return: void

stage3_filter_files(unfiltered_key='Unfiltered')[source]

stage 3 of the data gathering process: Filter the files out (C files). Get the good ones. Use the list provided and then insert them into json format. Currently uses default params.

Parameters:unfiltered_key – the directory to search through
Type:str
Returns:void
stage4_generate_llvm(folder=None, llvm_file_path='LLVM', object_file_path='Object', elf_file_path='elf', assembly_file_path='assembly')[source]

Stage 4 of the data gathering process: Generate LLVM and other data. gets file paths for llvm and object file path. Defaults to /LLVM and /Object.

Parameters:
  • folder – the file path of the folder to compile.
  • llvm_file_path – the file path to save LLVM files to.
  • object_file_path – the file path to save Object files to.
  • elf_file_path – the file path for the elf file, defaults to “elf”.
  • assembly_file_path – the file path for the assembly file, defaults to “assembly”.
Type:

str

Type:

str

Type:

str

Type:

str

Type:

str

Returns:

void

stage5_insert_database(folder=None)[source]

stage 5 of the gathering process: load into the database reading the meta and other info. :param folder: folder to iterate through to insert into database. Additionally, this generates cleaned C_Code to train off of. :type: str :return: void

decompy.DataGathering.FilterC module

class decompy.DataGathering.FilterC.FilterC[source]

Bases: object

Filters C files to our standards. This includes c header files that we find appropraite for machine learning. Filters out the maximum amount of bytes we would like in a file. As of now, this is 7000 bytes. Filters our words that may be too difficult: malloc, FILE, and threading DOES NOT CHECK IF IT SUCCESSFULLY COMPILES. This is handled by ClangSubprocess which generates a new file with file paths to use.

C_BLACKLIST = ('file', 'malloc', 'realloc', 'calloc', 'free')
C_WHITELIST_HEADERS = ('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype')
FILE_TYPE = '.c'
MAX_BYTES = 8000
MIN_BYTES = 21
static check_blacklisted_words(line, blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]

Lowercases the line to evaluate, and returns false if any blacklisted word is found.

Param:the string from a file.
Type:str
Param:the blacklisted array.
Type:str arr
Returns:boolean
static check_byte_size(file, preferred_max_size=8000, preferred_min_size=21)[source]

Finds the file size and tests it against the preferred_size in bytes. The default is 7000 bytes.

Parameters:
  • file – the file path to test against
  • preferred_max_size – the preferred size to search for, defaults to 7000.
  • preferred_min_size – the preferred minimum size to search for, defaults to 21 (for int main(){return 0;})
Type:

str

Type:

int

Type:

int

Returns:

boolean

static check_headers(line, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'))[source]

Uses a regex to evaluate the line, ignoring the case, and returns false if any unknown header is found.

Param:the string from a file.
Type:str
Param:the blacklisted array, defaults to
Type:str arr
Returns:boolean
static check_valid_data(file, preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]

Runs validation testing on a given file string. This includes the correct byte size, predetermined whitelisted headers, and predetermined blacklisted headers.

Parameters:
  • file – the file the user wants to validate.
  • preferred_max_size – the max byte size the user wants
  • preferred_min_size – the min byte size the user wants
  • whitelisted – the whitelisted headers to search for.
  • blacklisted – the blacklisted words to exclude.
Type:

str

Type:

str

Type:

int

Type:

tuple or array

Type:

tuple or array

Returns:

bool

static check_valid_folder(folder, filt_path_name='Unfiltered', preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]

Runs check_valid_data for each file in the folder path.

Parameters:
  • folder – the folder the user wants to validate for each C file.
  • filt_path_name – the filtered path word the user is using to store data once filtered.
  • preferred_max_size – the max byte size the user wants.
  • preferred_min_size – the max byte size the user wants.
  • whitelisted – the whitelisted headers the user wants.
  • blacklisted – the blacklisted words the user wants to exclude.
Type:

str

Type:

str

Type:

int

Type:

int

Type:

tuple or list

Type:

tuple or list

Returns:

a list of filtered file paths

Return type:

list

script_dir = '/Users/Josh/Documents/DecomPy/docs'

decompy.DataGathering.RepoFilter module

class decompy.DataGathering.RepoFilter.RepoFilter(search, language=None, blacklist=None, per_page=100, username=None, password=None)[source]

Bases: object

First draft of the RepoFilter using the GitHub api. This class searches for repositories on GitHub matching a search. It was written with offline queries in mind as I did not have access to the internet most of the time while writing it. This might be useful in the end for backups / redundancy however and provides the ability save a list so that another query does not have to be made to GitHub and so that results will not change as repositories change. Using too old of a list might be bad however as the content will not match what is being filtered on.

filter_repo(repo)[source]

Determine if the given Repo matches our desired criteria. Uses the GitHub info beyond what a simple search has the options of doing.

Parameters:repo – The json from the GitHub repo to filter
Type:json
get_filtered_list(unfiltered_repos)[source]

Filters a list of repositories

Parameters:unfiltered_repos – The list of Unfiltered repositories
Tpye:list
get_results(date, page)[source]

Makes a single request to the GitGub api for a page with results matching the search criteria.

Parameters:
  • date – because github only allows us 1000 results because they are bad at making an api.
  • page – Which page of results should be fetched.
Type:

str

Type:

int

Returns:

void

offline_filtered_list(filename, unfiltered_repos)[source]

Filter a list of repositories and save it to JSON for persistent usage

Parameters:
  • filename – The name to save the file as
  • unfiltered_repos – The list of repos to filter
Type:

str

Type:

list

static offline_read_json(filename)[source]

Read in a json file

Parameters:filename – The filename to read
Type:str
offline_results(filename, date, start_page, end_page)[source]

Saves the list of all repos for offline usage.

Parameters:
  • filename – The name that will be given to the generated file
  • date – the date to search for
  • start_page – The index of the first page that should be saved
  • end_page – The index of the last page that should be saved
Type:

str

Type:

str

Type:

int

Type:

int

decompy.DataGathering.RepoStructure module

class decompy.DataGathering.RepoStructure.RepoStructure(repo_path='Repositories', parent_dir='.')[source]

Bases: object

batch_format(repos_json, filter_date)[source]

Format a batch of repos from a json list :param repos_json: A list of repos from the (legacy) GitHub api :param filter_date: The date that the files were filtered

format_repo(repo_json, filter_date)[source]

Format a single repo from the GitHub API :param repo_json: The json from a single GitHub repo :param filter_date: The filter date

decompy.DataGathering.WebNavigator module

class decompy.DataGathering.WebNavigator.WebNavigator[source]

Bases: object

Defines methods for navigating web links

DEBUG = False
TIMER = 0
TIMING = False
static getAbsolute(ResolvedParent, RelativeLinks)[source]

Creates absolute URLs from a set of links and their parent

Parameters:
  • ResolvedParent (str) – Any resolved parent URL
  • RelativeLinks (set of strings) – A set containing relative URLs
Returns:

The absolute URLs of the relative URLs

Returns:

set of str

static getAbsoluteLinksFromPage(link, domain=None)[source]

Combines above methods into single explore method

Parameters:
  • link – the absolute link to resolve
  • domain – the domain of the above link to stay within (default is no domain limiting)
Returns:

set of absolute URLs within a page

static getContent(link)[source]

Retrieves the content from a link :param link: An absolute URL :return: page content :return: str

Finds all links contained on a page from a link

Parameters:content (str) – HTML content of a page
Returns:a set of links
Returns:set of strings
static getVisibleTextContent(link)[source]

Retrieves only the visible text from a link (no tags, etc.)

Parameters:link – An absolute URL
Returns:list of visible text
static limitDomain(absoluteLinks, domain)[source]

Prunes all links outside of a given domain

:param links A set of absolute links :type links: set of strings :param str domain: The domain used to filter the links. Should be of form example.com (not of www.google.com or https://www.example.com)

Returns:A filtered set of links

Module contents

class decompy.DataGathering.FileGetter[source]

Bases: object

Handles the download of GitHub repositories and extracting the useful files

static download_all_files(repo_urls, target_directories=None, username=None, password=None)[source]

Handles the downloading of ZIP archives and extracting the appropriate files into the target directory. :param repo_urls: get the list of URLs to repositories. URLs must be to the top level of the repositories :type: str :param target_directories: File directory to store the files :type: str :param username: the github username. :type: str :param password: the github password. :type: str :return: Nothing

class decompy.DataGathering.RepoFilter(search, language=None, blacklist=None, per_page=100, username=None, password=None)[source]

Bases: object

First draft of the RepoFilter using the GitHub api. This class searches for repositories on GitHub matching a search. It was written with offline queries in mind as I did not have access to the internet most of the time while writing it. This might be useful in the end for backups / redundancy however and provides the ability save a list so that another query does not have to be made to GitHub and so that results will not change as repositories change. Using too old of a list might be bad however as the content will not match what is being filtered on.

filter_repo(repo)[source]

Determine if the given Repo matches our desired criteria. Uses the GitHub info beyond what a simple search has the options of doing.

Parameters:repo – The json from the GitHub repo to filter
Type:json
get_filtered_list(unfiltered_repos)[source]

Filters a list of repositories

Parameters:unfiltered_repos – The list of Unfiltered repositories
Tpye:list
get_results(date, page)[source]

Makes a single request to the GitGub api for a page with results matching the search criteria.

Parameters:
  • date – because github only allows us 1000 results because they are bad at making an api.
  • page – Which page of results should be fetched.
Type:

str

Type:

int

Returns:

void

offline_filtered_list(filename, unfiltered_repos)[source]

Filter a list of repositories and save it to JSON for persistent usage

Parameters:
  • filename – The name to save the file as
  • unfiltered_repos – The list of repos to filter
Type:

str

Type:

list

static offline_read_json(filename)[source]

Read in a json file

Parameters:filename – The filename to read
Type:str
offline_results(filename, date, start_page, end_page)[source]

Saves the list of all repos for offline usage.

Parameters:
  • filename – The name that will be given to the generated file
  • date – the date to search for
  • start_page – The index of the first page that should be saved
  • end_page – The index of the last page that should be saved
Type:

str

Type:

str

Type:

int

Type:

int

class decompy.DataGathering.RepoStructure(repo_path='Repositories', parent_dir='.')[source]

Bases: object

batch_format(repos_json, filter_date)[source]

Format a batch of repos from a json list :param repos_json: A list of repos from the (legacy) GitHub api :param filter_date: The date that the files were filtered

format_repo(repo_json, filter_date)[source]

Format a single repo from the GitHub API :param repo_json: The json from a single GitHub repo :param filter_date: The filter date

class decompy.DataGathering.Clang[source]

Bases: object

Class to define functions for calling the Clang compiler

static compile_all(file_path, newlocation, out_type, args='')[source]

Compiles the C file given as a path with Clang, using the specified args. Writes to a file by calling compile_cfile then returns the specified location of the file path. If this is being used to filter the input files, the C files that successfully compile will be entered in.

Parameters:
  • file_path – File with list of C file names to compile
  • newlocation – location to save LLVM files to
  • out_type – the type that the file must be compiled to, such as “elf”
  • args – Arguments for the compiler to use while compiling
static compile_cfile(file_in, newlocation, output_type, args)[source]

Compiles the specified C file with Clang, using the specified args. Stores this file in the specified location and returns the location as a string. If this is being used to filter the input file and if the C file successfully compiles it will be entered in the filter file.

Parameters:
  • file_in – File to compile
  • newlocation – location to save LLVM files to
  • output_type – the type that the file must be compiled to, such as “elf”
  • args – Arguments for the compiler to use while compiling
static to_assembly(file_path, newlocation)[source]

Compiles the C file given as a path to x86 assembly. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – file path to compile
  • newlocation – location to save assembly files to
Returns:

the file location which llvm_unopt was saved to.

Return type:

str or None

static to_elf(file_path, newlocation)[source]

Compiles the C file given as a path to elf executables. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – file path to compile
  • newlocation – location to save LLVM files to
static to_llvm_opt(file_path, newlocation, optlevel='')[source]

Compiles the C file given as a path to optimized LLVM IR, at the specified opt level. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – File with list of C file names to compile
  • newlocation – location to save LLVM files to
Returns:

the file location which llvm_unopt was saved to.

Return type:

str or None

static to_llvm_unopt(file_path, newlocation)[source]

Compiles the C file given as a path to unoptimized LLVM IR. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – File with list of C file names to compile
  • newlocation – location to save LLVM files to
Returns:

the file location which llvm_unopt was saved to.

Return type:

str or None

static to_object_file(file_path, newlocation)[source]

Compiles the C file given as a path to an object file. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:
  • file_path – File with list of C file names to compile
  • newlocation – location to save Object files to
Returns:

the file location which llvm_unopt was saved to.

Return type:

str or None

class decompy.DataGathering.FilterC[source]

Bases: object

Filters C files to our standards. This includes c header files that we find appropraite for machine learning. Filters out the maximum amount of bytes we would like in a file. As of now, this is 7000 bytes. Filters our words that may be too difficult: malloc, FILE, and threading DOES NOT CHECK IF IT SUCCESSFULLY COMPILES. This is handled by ClangSubprocess which generates a new file with file paths to use.

C_BLACKLIST = ('file', 'malloc', 'realloc', 'calloc', 'free')
C_WHITELIST_HEADERS = ('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype')
FILE_TYPE = '.c'
MAX_BYTES = 8000
MIN_BYTES = 21
static check_blacklisted_words(line, blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]

Lowercases the line to evaluate, and returns false if any blacklisted word is found.

Param:the string from a file.
Type:str
Param:the blacklisted array.
Type:str arr
Returns:boolean
static check_byte_size(file, preferred_max_size=8000, preferred_min_size=21)[source]

Finds the file size and tests it against the preferred_size in bytes. The default is 7000 bytes.

Parameters:
  • file – the file path to test against
  • preferred_max_size – the preferred size to search for, defaults to 7000.
  • preferred_min_size – the preferred minimum size to search for, defaults to 21 (for int main(){return 0;})
Type:

str

Type:

int

Type:

int

Returns:

boolean

static check_headers(line, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'))[source]

Uses a regex to evaluate the line, ignoring the case, and returns false if any unknown header is found.

Param:the string from a file.
Type:str
Param:the blacklisted array, defaults to
Type:str arr
Returns:boolean
static check_valid_data(file, preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]

Runs validation testing on a given file string. This includes the correct byte size, predetermined whitelisted headers, and predetermined blacklisted headers.

Parameters:
  • file – the file the user wants to validate.
  • preferred_max_size – the max byte size the user wants
  • preferred_min_size – the min byte size the user wants
  • whitelisted – the whitelisted headers to search for.
  • blacklisted – the blacklisted words to exclude.
Type:

str

Type:

str

Type:

int

Type:

tuple or array

Type:

tuple or array

Returns:

bool

static check_valid_folder(folder, filt_path_name='Unfiltered', preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]

Runs check_valid_data for each file in the folder path.

Parameters:
  • folder – the folder the user wants to validate for each C file.
  • filt_path_name – the filtered path word the user is using to store data once filtered.
  • preferred_max_size – the max byte size the user wants.
  • preferred_min_size – the max byte size the user wants.
  • whitelisted – the whitelisted headers the user wants.
  • blacklisted – the blacklisted words the user wants to exclude.
Type:

str

Type:

str

Type:

int

Type:

int

Type:

tuple or list

Type:

tuple or list

Returns:

a list of filtered file paths

Return type:

list

script_dir = '/Users/Josh/Documents/DecomPy/docs'