decompy.DataGathering package¶

Submodules¶

decompy.DataGathering.ClangSubprocess module¶

class decompy.DataGathering.ClangSubprocess.Clang[source]¶

Bases: object

Class to define functions for calling the Clang compiler

static compile_all(file_path, newlocation, out_type, args='')[source]¶

Compiles the C file given as a path with Clang, using the specified args. Writes to a file by calling compile_cfile then returns the specified location of the file path. If this is being used to filter the input files, the C files that successfully compile will be entered in.

Parameters:	file_path – File with list of C file names to compile newlocation – location to save LLVM files to out_type – the type that the file must be compiled to, such as “elf” args – Arguments for the compiler to use while compiling

static compile_cfile(file_in, newlocation, output_type, args)[source]¶

Compiles the specified C file with Clang, using the specified args. Stores this file in the specified location and returns the location as a string. If this is being used to filter the input file and if the C file successfully compiles it will be entered in the filter file.

Parameters:	file_in – File to compile newlocation – location to save LLVM files to output_type – the type that the file must be compiled to, such as “elf” args – Arguments for the compiler to use while compiling

static to_assembly(file_path, newlocation)[source]¶

Compiles the C file given as a path to x86 assembly. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – file path to compile newlocation – location to save assembly files to
Returns:	the file location which llvm_unopt was saved to.
Return type:	str or None

static to_elf(file_path, newlocation)[source]¶

Compiles the C file given as a path to elf executables. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – file path to compile newlocation – location to save LLVM files to

static to_llvm_opt(file_path, newlocation, optlevel='')[source]¶

Compiles the C file given as a path to optimized LLVM IR, at the specified opt level. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – File with list of C file names to compile newlocation – location to save LLVM files to
Returns:	the file location which llvm_unopt was saved to.
Return type:	str or None

static to_llvm_unopt(file_path, newlocation)[source]¶

Compiles the C file given as a path to unoptimized LLVM IR. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – File with list of C file names to compile newlocation – location to save LLVM files to
Returns:	the file location which llvm_unopt was saved to.
Return type:	str or None

static to_object_file(file_path, newlocation)[source]¶

Compiles the C file given as a path to an object file. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – File with list of C file names to compile newlocation – location to save Object files to
Returns:	the file location which llvm_unopt was saved to.
Return type:	str or None

decompy.DataGathering.CreateLocalData module¶

class decompy.DataGathering.CreateLocalData.CreateLocalData(folder='Repositories', dest_folder='RepositoriesFiltered', database_name='c_code', repo_dict={'blacklist': ['C++', 'C#', 'css'], 'language': 'C', 'per_page': 100, 'search': 'C '}, repo_json_name='offlineResults.json', repo_json_filtered_name='filteredOfflineResults.json', filtered_repos=None, save_json='repo.json', config_file='config.json', repo_start_date=None, repo_end_date=None, verbose=True)[source]¶

Bases: object

Gathers the data and prepares to use it in stages. This utilizes the following files: RepoFilter, GitHubScraper, FilterC, and ClangSubprocess. These combined will get all the relevant data

all_stages_increment(start_date=None, end_date=None, start_page=1, end_page=3)[source]¶: runs all five stages in increments. :param start_date: date to start or pick back up formatted “%Y-%m-%d :type: str :param end_date: date to end on formatted “%Y-%m-%d :type: str :param start_page: page to start or pick up from. :type: int :param end_page: page to end or pick up from. :type: int :return: void

static change_stored_directory(folder, file_path)[source]¶: Changes the stored directory in the repo.json to the specified folder name. This is especially useful if you are changing the name from the recommended “Repositories” to another name. :param folder: the new folder to look for :param file_path: the file path to change :return: the new file path :rtype: str

stage1_gather_repo_meta(date, start_page, end_page)[source]¶: stage 1 of the data gathering process: Gather the data from the repos and store it into a json file. :param date: the date to read from :type: str :param start_page: where to start getting the data (start page) default is 1. :type: int :param end_page: where to end the page (end page) default is 2. :type: int :return: void

stage2_get_repos(test=False, username=None, password=None)[source]¶: stage 2 of the data gathering process: Scrape all the files from GitHub from the given offline json file. :param test: whether or not to test :type: bool :param username: the github username to download more data. :type: str :param password: the github user’s password. :type: str :return: void

stage3_filter_files(unfiltered_key='Unfiltered')[source]¶

stage 3 of the data gathering process: Filter the files out (C files). Get the good ones. Use the list provided and then insert them into json format. Currently uses default params.

Parameters:	unfiltered_key – the directory to search through
Type:	str
Returns:	void

stage4_generate_llvm(folder=None, llvm_file_path='LLVM', object_file_path='Object', elf_file_path='elf', assembly_file_path='assembly')[source]¶

Stage 4 of the data gathering process: Generate LLVM and other data. gets file paths for llvm and object file path. Defaults to /LLVM and /Object.

Parameters:	folder – the file path of the folder to compile. llvm_file_path – the file path to save LLVM files to. object_file_path – the file path to save Object files to. elf_file_path – the file path for the elf file, defaults to “elf”. assembly_file_path – the file path for the assembly file, defaults to “assembly”.
Type:	str
Type:	str
Type:	str
Type:	str
Type:	str
Returns:	void

stage5_insert_database(folder=None)[source]¶: stage 5 of the gathering process: load into the database reading the meta and other info. :param folder: folder to iterate through to insert into database. Additionally, this generates cleaned C_Code to train off of. :type: str :return: void

decompy.DataGathering.FilterC module¶

class decompy.DataGathering.FilterC.FilterC[source]¶

Bases: object

Filters C files to our standards. This includes c header files that we find appropraite for machine learning. Filters out the maximum amount of bytes we would like in a file. As of now, this is 7000 bytes. Filters our words that may be too difficult: malloc, FILE, and threading DOES NOT CHECK IF IT SUCCESSFULLY COMPILES. This is handled by ClangSubprocess which generates a new file with file paths to use.

C_BLACKLIST = ('file', 'malloc', 'realloc', 'calloc', 'free')¶

C_WHITELIST_HEADERS = ('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype')¶

FILE_TYPE = '.c'¶

MAX_BYTES = 8000¶

MIN_BYTES = 21¶

static check_blacklisted_words(line, blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶

Lowercases the line to evaluate, and returns false if any blacklisted word is found.

Param:	the string from a file.
Type:	str
Param:	the blacklisted array.
Type:	str arr
Returns:	boolean

static check_byte_size(file, preferred_max_size=8000, preferred_min_size=21)[source]¶

Finds the file size and tests it against the preferred_size in bytes. The default is 7000 bytes.

Parameters:	file – the file path to test against preferred_max_size – the preferred size to search for, defaults to 7000. preferred_min_size – the preferred minimum size to search for, defaults to 21 (for int main(){return 0;})
Type:	str
Type:	int
Type:	int
Returns:	boolean

static check_headers(line, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'))[source]¶

Uses a regex to evaluate the line, ignoring the case, and returns false if any unknown header is found.

Param:	the string from a file.
Type:	str
Param:	the blacklisted array, defaults to
Type:	str arr
Returns:	boolean

static check_valid_data(file, preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶

Runs validation testing on a given file string. This includes the correct byte size, predetermined whitelisted headers, and predetermined blacklisted headers.

Parameters:	file – the file the user wants to validate. preferred_max_size – the max byte size the user wants preferred_min_size – the min byte size the user wants whitelisted – the whitelisted headers to search for. blacklisted – the blacklisted words to exclude.
Type:	str
Type:	str
Type:	int
Type:	tuple or array
Type:	tuple or array
Returns:	bool

static check_valid_folder(folder, filt_path_name='Unfiltered', preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶

Runs check_valid_data for each file in the folder path.

Parameters:	folder – the folder the user wants to validate for each C file. filt_path_name – the filtered path word the user is using to store data once filtered. preferred_max_size – the max byte size the user wants. preferred_min_size – the max byte size the user wants. whitelisted – the whitelisted headers the user wants. blacklisted – the blacklisted words the user wants to exclude.
Type:	str
Type:	str
Type:	int
Type:	int
Type:	tuple or list
Type:	tuple or list
Returns:	a list of filtered file paths
Return type:	list

script_dir = '/Users/Josh/Documents/DecomPy/docs'¶

decompy.DataGathering.RepoFilter module¶

class decompy.DataGathering.RepoFilter.RepoFilter(search, language=None, blacklist=None, per_page=100, username=None, password=None)[source]¶

Bases: object

First draft of the RepoFilter using the GitHub api. This class searches for repositories on GitHub matching a search. It was written with offline queries in mind as I did not have access to the internet most of the time while writing it. This might be useful in the end for backups / redundancy however and provides the ability save a list so that another query does not have to be made to GitHub and so that results will not change as repositories change. Using too old of a list might be bad however as the content will not match what is being filtered on.

filter_repo(repo)[source]¶

Determine if the given Repo matches our desired criteria. Uses the GitHub info beyond what a simple search has the options of doing.

Parameters:	repo – The json from the GitHub repo to filter
Type:	json

get_filtered_list(unfiltered_repos)[source]¶

Filters a list of repositories

Parameters:	unfiltered_repos – The list of Unfiltered repositories
Tpye:	list

get_results(date, page)[source]¶

Makes a single request to the GitGub api for a page with results matching the search criteria.

Parameters:	date – because github only allows us 1000 results because they are bad at making an api. page – Which page of results should be fetched.
Type:	str
Type:	int
Returns:	void

offline_filtered_list(filename, unfiltered_repos)[source]¶

Filter a list of repositories and save it to JSON for persistent usage

Parameters:	filename – The name to save the file as unfiltered_repos – The list of repos to filter
Type:	str
Type:	list

static offline_read_json(filename)[source]¶

Read in a json file

Parameters:	filename – The filename to read
Type:	str

offline_results(filename, date, start_page, end_page)[source]¶

Saves the list of all repos for offline usage.

Parameters:	filename – The name that will be given to the generated file date – the date to search for start_page – The index of the first page that should be saved end_page – The index of the last page that should be saved
Type:	str
Type:	str
Type:	int
Type:	int

decompy.DataGathering.RepoStructure module¶

class decompy.DataGathering.RepoStructure.RepoStructure(repo_path='Repositories', parent_dir='.')[source]¶

Bases: object

batch_format(repos_json, filter_date)[source]¶: Format a batch of repos from a json list :param repos_json: A list of repos from the (legacy) GitHub api :param filter_date: The date that the files were filtered

format_repo(repo_json, filter_date)[source]¶: Format a single repo from the GitHub API :param repo_json: The json from a single GitHub repo :param filter_date: The filter date

decompy.DataGathering.WebNavigator module¶

class decompy.DataGathering.WebNavigator.WebNavigator[source]¶

Bases: object

Defines methods for navigating web links

DEBUG = False¶

TIMER = 0¶

TIMING = False¶

static getAbsolute(ResolvedParent, RelativeLinks)[source]¶

Creates absolute URLs from a set of links and their parent

Parameters:	ResolvedParent (str) – Any resolved parent URL RelativeLinks (set of strings) – A set containing relative URLs
Returns:	The absolute URLs of the relative URLs
Returns:	set of str

static getAbsoluteLinksFromPage(link, domain=None)[source]¶

Combines above methods into single explore method

Parameters:	link – the absolute link to resolve domain – the domain of the above link to stay within (default is no domain limiting)
Returns:	set of absolute URLs within a page

static getContent(link)[source]¶: Retrieves the content from a link :param link: An absolute URL :return: page content :return: str

static getLinks(content)[source]¶

Finds all links contained on a page from a link

Parameters:	content (str) – HTML content of a page
Returns:	a set of links
Returns:	set of strings

static getVisibleTextContent(link)[source]¶

Retrieves only the visible text from a link (no tags, etc.)

Parameters:	link – An absolute URL
Returns:	list of visible text

static limitDomain(absoluteLinks, domain)[source]¶

Prunes all links outside of a given domain

:param links A set of absolute links :type links: set of strings :param str domain: The domain used to filter the links. Should be of form example.com (not of www.google.com or https://www.example.com)

Returns:	A filtered set of links

Module contents¶

class decompy.DataGathering.FileGetter[source]¶

Bases: object

Handles the download of GitHub repositories and extracting the useful files

static download_all_files(repo_urls, target_directories=None, username=None, password=None)[source]¶: Handles the downloading of ZIP archives and extracting the appropriate files into the target directory. :param repo_urls: get the list of URLs to repositories. URLs must be to the top level of the repositories :type: str :param target_directories: File directory to store the files :type: str :param username: the github username. :type: str :param password: the github password. :type: str :return: Nothing

class decompy.DataGathering.RepoFilter(search, language=None, blacklist=None, per_page=100, username=None, password=None)[source]¶

Bases: object

First draft of the RepoFilter using the GitHub api. This class searches for repositories on GitHub matching a search. It was written with offline queries in mind as I did not have access to the internet most of the time while writing it. This might be useful in the end for backups / redundancy however and provides the ability save a list so that another query does not have to be made to GitHub and so that results will not change as repositories change. Using too old of a list might be bad however as the content will not match what is being filtered on.

filter_repo(repo)[source]¶

Determine if the given Repo matches our desired criteria. Uses the GitHub info beyond what a simple search has the options of doing.

Parameters:	repo – The json from the GitHub repo to filter
Type:	json

get_filtered_list(unfiltered_repos)[source]¶

Filters a list of repositories

Parameters:	unfiltered_repos – The list of Unfiltered repositories
Tpye:	list

get_results(date, page)[source]¶

Makes a single request to the GitGub api for a page with results matching the search criteria.

Parameters:	date – because github only allows us 1000 results because they are bad at making an api. page – Which page of results should be fetched.
Type:	str
Type:	int
Returns:	void

offline_filtered_list(filename, unfiltered_repos)[source]¶

Filter a list of repositories and save it to JSON for persistent usage

Parameters:	filename – The name to save the file as unfiltered_repos – The list of repos to filter
Type:	str
Type:	list

static offline_read_json(filename)[source]¶

Read in a json file

Parameters:	filename – The filename to read
Type:	str

offline_results(filename, date, start_page, end_page)[source]¶

Saves the list of all repos for offline usage.

Parameters:	filename – The name that will be given to the generated file date – the date to search for start_page – The index of the first page that should be saved end_page – The index of the last page that should be saved
Type:	str
Type:	str
Type:	int
Type:	int

class decompy.DataGathering.RepoStructure(repo_path='Repositories', parent_dir='.')[source]¶

Bases: object

batch_format(repos_json, filter_date)[source]¶: Format a batch of repos from a json list :param repos_json: A list of repos from the (legacy) GitHub api :param filter_date: The date that the files were filtered

format_repo(repo_json, filter_date)[source]¶: Format a single repo from the GitHub API :param repo_json: The json from a single GitHub repo :param filter_date: The filter date

class decompy.DataGathering.Clang[source]¶

Bases: object

Class to define functions for calling the Clang compiler

static compile_all(file_path, newlocation, out_type, args='')[source]¶

Compiles the C file given as a path with Clang, using the specified args. Writes to a file by calling compile_cfile then returns the specified location of the file path. If this is being used to filter the input files, the C files that successfully compile will be entered in.

Parameters:	file_path – File with list of C file names to compile newlocation – location to save LLVM files to out_type – the type that the file must be compiled to, such as “elf” args – Arguments for the compiler to use while compiling

static compile_cfile(file_in, newlocation, output_type, args)[source]¶

Compiles the specified C file with Clang, using the specified args. Stores this file in the specified location and returns the location as a string. If this is being used to filter the input file and if the C file successfully compiles it will be entered in the filter file.

Parameters:	file_in – File to compile newlocation – location to save LLVM files to output_type – the type that the file must be compiled to, such as “elf” args – Arguments for the compiler to use while compiling

static to_assembly(file_path, newlocation)[source]¶

Compiles the C file given as a path to x86 assembly. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – file path to compile newlocation – location to save assembly files to
Returns:	the file location which llvm_unopt was saved to.
Return type:	str or None

static to_elf(file_path, newlocation)[source]¶

Compiles the C file given as a path to elf executables. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – file path to compile newlocation – location to save LLVM files to

static to_llvm_opt(file_path, newlocation, optlevel='')[source]¶

Compiles the C file given as a path to optimized LLVM IR, at the specified opt level. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – File with list of C file names to compile newlocation – location to save LLVM files to
Returns:	the file location which llvm_unopt was saved to.
Return type:	str or None

static to_llvm_unopt(file_path, newlocation)[source]¶

Compiles the C file given as a path to unoptimized LLVM IR. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – File with list of C file names to compile newlocation – location to save LLVM files to
Returns:	the file location which llvm_unopt was saved to.
Return type:	str or None

static to_object_file(file_path, newlocation)[source]¶

Compiles the C file given as a path to an object file. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.

Parameters:	file_path – File with list of C file names to compile newlocation – location to save Object files to
Returns:	the file location which llvm_unopt was saved to.
Return type:	str or None

class decompy.DataGathering.FilterC[source]¶

Bases: object

Filters C files to our standards. This includes c header files that we find appropraite for machine learning. Filters out the maximum amount of bytes we would like in a file. As of now, this is 7000 bytes. Filters our words that may be too difficult: malloc, FILE, and threading DOES NOT CHECK IF IT SUCCESSFULLY COMPILES. This is handled by ClangSubprocess which generates a new file with file paths to use.

C_BLACKLIST = ('file', 'malloc', 'realloc', 'calloc', 'free')¶

C_WHITELIST_HEADERS = ('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype')¶

FILE_TYPE = '.c'¶

MAX_BYTES = 8000¶

MIN_BYTES = 21¶

static check_blacklisted_words(line, blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶

Lowercases the line to evaluate, and returns false if any blacklisted word is found.

Param:	the string from a file.
Type:	str
Param:	the blacklisted array.
Type:	str arr
Returns:	boolean

static check_byte_size(file, preferred_max_size=8000, preferred_min_size=21)[source]¶

Finds the file size and tests it against the preferred_size in bytes. The default is 7000 bytes.

Parameters:	file – the file path to test against preferred_max_size – the preferred size to search for, defaults to 7000. preferred_min_size – the preferred minimum size to search for, defaults to 21 (for int main(){return 0;})
Type:	str
Type:	int
Type:	int
Returns:	boolean

static check_headers(line, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'))[source]¶

Uses a regex to evaluate the line, ignoring the case, and returns false if any unknown header is found.

Param:	the string from a file.
Type:	str
Param:	the blacklisted array, defaults to
Type:	str arr
Returns:	boolean

static check_valid_data(file, preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶

Runs validation testing on a given file string. This includes the correct byte size, predetermined whitelisted headers, and predetermined blacklisted headers.

Parameters:	file – the file the user wants to validate. preferred_max_size – the max byte size the user wants preferred_min_size – the min byte size the user wants whitelisted – the whitelisted headers to search for. blacklisted – the blacklisted words to exclude.
Type:	str
Type:	str
Type:	int
Type:	tuple or array
Type:	tuple or array
Returns:	bool

static check_valid_folder(folder, filt_path_name='Unfiltered', preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶

Runs check_valid_data for each file in the folder path.

Parameters:	folder – the folder the user wants to validate for each C file. filt_path_name – the filtered path word the user is using to store data once filtered. preferred_max_size – the max byte size the user wants. preferred_min_size – the max byte size the user wants. whitelisted – the whitelisted headers the user wants. blacklisted – the blacklisted words the user wants to exclude.
Type:	str
Type:	str
Type:	int
Type:	int
Type:	tuple or list
Type:	tuple or list
Returns:	a list of filtered file paths
Return type:	list

script_dir = '/Users/Josh/Documents/DecomPy/docs'¶