decompy.DataGathering package¶
Submodules¶
decompy.DataGathering.ClangSubprocess module¶
-
class
decompy.DataGathering.ClangSubprocess.
Clang
[source]¶ Bases:
object
Class to define functions for calling the Clang compiler
-
static
compile_all
(file_path, newlocation, out_type, args='')[source]¶ Compiles the C file given as a path with Clang, using the specified args. Writes to a file by calling compile_cfile then returns the specified location of the file path. If this is being used to filter the input files, the C files that successfully compile will be entered in.
Parameters: - file_path – File with list of C file names to compile
- newlocation – location to save LLVM files to
- out_type – the type that the file must be compiled to, such as “elf”
- args – Arguments for the compiler to use while compiling
-
static
compile_cfile
(file_in, newlocation, output_type, args)[source]¶ Compiles the specified C file with Clang, using the specified args. Stores this file in the specified location and returns the location as a string. If this is being used to filter the input file and if the C file successfully compiles it will be entered in the filter file.
Parameters: - file_in – File to compile
- newlocation – location to save LLVM files to
- output_type – the type that the file must be compiled to, such as “elf”
- args – Arguments for the compiler to use while compiling
-
static
to_assembly
(file_path, newlocation)[source]¶ Compiles the C file given as a path to x86 assembly. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – file path to compile
- newlocation – location to save assembly files to
Returns: the file location which llvm_unopt was saved to.
Return type: str or None
-
static
to_elf
(file_path, newlocation)[source]¶ Compiles the C file given as a path to elf executables. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – file path to compile
- newlocation – location to save LLVM files to
-
static
to_llvm_opt
(file_path, newlocation, optlevel='')[source]¶ Compiles the C file given as a path to optimized LLVM IR, at the specified opt level. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – File with list of C file names to compile
- newlocation – location to save LLVM files to
Returns: the file location which llvm_unopt was saved to.
Return type: str or None
-
static
to_llvm_unopt
(file_path, newlocation)[source]¶ Compiles the C file given as a path to unoptimized LLVM IR. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – File with list of C file names to compile
- newlocation – location to save LLVM files to
Returns: the file location which llvm_unopt was saved to.
Return type: str or None
-
static
to_object_file
(file_path, newlocation)[source]¶ Compiles the C file given as a path to an object file. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – File with list of C file names to compile
- newlocation – location to save Object files to
Returns: the file location which llvm_unopt was saved to.
Return type: str or None
-
static
decompy.DataGathering.CreateLocalData module¶
-
class
decompy.DataGathering.CreateLocalData.
CreateLocalData
(folder='Repositories', dest_folder='RepositoriesFiltered', database_name='c_code', repo_dict={'blacklist': ['C++', 'C#', 'css'], 'language': 'C', 'per_page': 100, 'search': 'C '}, repo_json_name='offlineResults.json', repo_json_filtered_name='filteredOfflineResults.json', filtered_repos=None, save_json='repo.json', config_file='config.json', repo_start_date=None, repo_end_date=None, verbose=True)[source]¶ Bases:
object
Gathers the data and prepares to use it in stages. This utilizes the following files: RepoFilter, GitHubScraper, FilterC, and ClangSubprocess. These combined will get all the relevant data
-
all_stages_increment
(start_date=None, end_date=None, start_page=1, end_page=3)[source]¶ runs all five stages in increments. :param start_date: date to start or pick back up formatted “%Y-%m-%d :type: str :param end_date: date to end on formatted “%Y-%m-%d :type: str :param start_page: page to start or pick up from. :type: int :param end_page: page to end or pick up from. :type: int :return: void
-
static
change_stored_directory
(folder, file_path)[source]¶ Changes the stored directory in the repo.json to the specified folder name. This is especially useful if you are changing the name from the recommended “Repositories” to another name. :param folder: the new folder to look for :param file_path: the file path to change :return: the new file path :rtype: str
-
stage1_gather_repo_meta
(date, start_page, end_page)[source]¶ stage 1 of the data gathering process: Gather the data from the repos and store it into a json file. :param date: the date to read from :type: str :param start_page: where to start getting the data (start page) default is 1. :type: int :param end_page: where to end the page (end page) default is 2. :type: int :return: void
-
stage2_get_repos
(test=False, username=None, password=None)[source]¶ stage 2 of the data gathering process: Scrape all the files from GitHub from the given offline json file. :param test: whether or not to test :type: bool :param username: the github username to download more data. :type: str :param password: the github user’s password. :type: str :return: void
-
stage3_filter_files
(unfiltered_key='Unfiltered')[source]¶ stage 3 of the data gathering process: Filter the files out (C files). Get the good ones. Use the list provided and then insert them into json format. Currently uses default params.
Parameters: unfiltered_key – the directory to search through Type: str Returns: void
-
stage4_generate_llvm
(folder=None, llvm_file_path='LLVM', object_file_path='Object', elf_file_path='elf', assembly_file_path='assembly')[source]¶ Stage 4 of the data gathering process: Generate LLVM and other data. gets file paths for llvm and object file path. Defaults to /LLVM and /Object.
Parameters: - folder – the file path of the folder to compile.
- llvm_file_path – the file path to save LLVM files to.
- object_file_path – the file path to save Object files to.
- elf_file_path – the file path for the elf file, defaults to “elf”.
- assembly_file_path – the file path for the assembly file, defaults to “assembly”.
Type: str
Type: str
Type: str
Type: str
Type: str
Returns: void
-
decompy.DataGathering.FilterC module¶
-
class
decompy.DataGathering.FilterC.
FilterC
[source]¶ Bases:
object
Filters C files to our standards. This includes c header files that we find appropraite for machine learning. Filters out the maximum amount of bytes we would like in a file. As of now, this is 7000 bytes. Filters our words that may be too difficult: malloc, FILE, and threading DOES NOT CHECK IF IT SUCCESSFULLY COMPILES. This is handled by ClangSubprocess which generates a new file with file paths to use.
-
C_BLACKLIST
= ('file', 'malloc', 'realloc', 'calloc', 'free')¶
-
C_WHITELIST_HEADERS
= ('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype')¶
-
FILE_TYPE
= '.c'¶
-
MAX_BYTES
= 8000¶
-
MIN_BYTES
= 21¶
-
static
check_blacklisted_words
(line, blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶ Lowercases the line to evaluate, and returns false if any blacklisted word is found.
Param: the string from a file. Type: str Param: the blacklisted array. Type: str arr Returns: boolean
-
static
check_byte_size
(file, preferred_max_size=8000, preferred_min_size=21)[source]¶ Finds the file size and tests it against the preferred_size in bytes. The default is 7000 bytes.
Parameters: - file – the file path to test against
- preferred_max_size – the preferred size to search for, defaults to 7000.
- preferred_min_size – the preferred minimum size to search for, defaults to 21 (for int main(){return 0;})
Type: str
Type: int
Type: int
Returns: boolean
-
static
check_headers
(line, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'))[source]¶ Uses a regex to evaluate the line, ignoring the case, and returns false if any unknown header is found.
Param: the string from a file. Type: str Param: the blacklisted array, defaults to Type: str arr Returns: boolean
-
static
check_valid_data
(file, preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶ Runs validation testing on a given file string. This includes the correct byte size, predetermined whitelisted headers, and predetermined blacklisted headers.
Parameters: - file – the file the user wants to validate.
- preferred_max_size – the max byte size the user wants
- preferred_min_size – the min byte size the user wants
- whitelisted – the whitelisted headers to search for.
- blacklisted – the blacklisted words to exclude.
Type: str
Type: str
Type: int
Type: tuple or array
Type: tuple or array
Returns: bool
-
static
check_valid_folder
(folder, filt_path_name='Unfiltered', preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶ Runs check_valid_data for each file in the folder path.
Parameters: - folder – the folder the user wants to validate for each C file.
- filt_path_name – the filtered path word the user is using to store data once filtered.
- preferred_max_size – the max byte size the user wants.
- preferred_min_size – the max byte size the user wants.
- whitelisted – the whitelisted headers the user wants.
- blacklisted – the blacklisted words the user wants to exclude.
Type: str
Type: str
Type: int
Type: int
Type: tuple or list
Type: tuple or list
Returns: a list of filtered file paths
Return type: list
-
script_dir
= '/Users/Josh/Documents/DecomPy/docs'¶
-
decompy.DataGathering.RepoFilter module¶
-
class
decompy.DataGathering.RepoFilter.
RepoFilter
(search, language=None, blacklist=None, per_page=100, username=None, password=None)[source]¶ Bases:
object
First draft of the RepoFilter using the GitHub api. This class searches for repositories on GitHub matching a search. It was written with offline queries in mind as I did not have access to the internet most of the time while writing it. This might be useful in the end for backups / redundancy however and provides the ability save a list so that another query does not have to be made to GitHub and so that results will not change as repositories change. Using too old of a list might be bad however as the content will not match what is being filtered on.
-
filter_repo
(repo)[source]¶ Determine if the given Repo matches our desired criteria. Uses the GitHub info beyond what a simple search has the options of doing.
Parameters: repo – The json from the GitHub repo to filter Type: json
-
get_filtered_list
(unfiltered_repos)[source]¶ Filters a list of repositories
Parameters: unfiltered_repos – The list of Unfiltered repositories Tpye: list
-
get_results
(date, page)[source]¶ Makes a single request to the GitGub api for a page with results matching the search criteria.
Parameters: - date – because github only allows us 1000 results because they are bad at making an api.
- page – Which page of results should be fetched.
Type: str
Type: int
Returns: void
-
offline_filtered_list
(filename, unfiltered_repos)[source]¶ Filter a list of repositories and save it to JSON for persistent usage
Parameters: - filename – The name to save the file as
- unfiltered_repos – The list of repos to filter
Type: str
Type: list
-
static
offline_read_json
(filename)[source]¶ Read in a json file
Parameters: filename – The filename to read Type: str
-
offline_results
(filename, date, start_page, end_page)[source]¶ Saves the list of all repos for offline usage.
Parameters: - filename – The name that will be given to the generated file
- date – the date to search for
- start_page – The index of the first page that should be saved
- end_page – The index of the last page that should be saved
Type: str
Type: str
Type: int
Type: int
-
decompy.DataGathering.RepoStructure module¶
-
class
decompy.DataGathering.RepoStructure.
RepoStructure
(repo_path='Repositories', parent_dir='.')[source]¶ Bases:
object
Module contents¶
-
class
decompy.DataGathering.
FileGetter
[source]¶ Bases:
object
Handles the download of GitHub repositories and extracting the useful files
-
static
download_all_files
(repo_urls, target_directories=None, username=None, password=None)[source]¶ Handles the downloading of ZIP archives and extracting the appropriate files into the target directory. :param repo_urls: get the list of URLs to repositories. URLs must be to the top level of the repositories :type: str :param target_directories: File directory to store the files :type: str :param username: the github username. :type: str :param password: the github password. :type: str :return: Nothing
-
static
-
class
decompy.DataGathering.
RepoFilter
(search, language=None, blacklist=None, per_page=100, username=None, password=None)[source]¶ Bases:
object
First draft of the RepoFilter using the GitHub api. This class searches for repositories on GitHub matching a search. It was written with offline queries in mind as I did not have access to the internet most of the time while writing it. This might be useful in the end for backups / redundancy however and provides the ability save a list so that another query does not have to be made to GitHub and so that results will not change as repositories change. Using too old of a list might be bad however as the content will not match what is being filtered on.
-
filter_repo
(repo)[source]¶ Determine if the given Repo matches our desired criteria. Uses the GitHub info beyond what a simple search has the options of doing.
Parameters: repo – The json from the GitHub repo to filter Type: json
-
get_filtered_list
(unfiltered_repos)[source]¶ Filters a list of repositories
Parameters: unfiltered_repos – The list of Unfiltered repositories Tpye: list
-
get_results
(date, page)[source]¶ Makes a single request to the GitGub api for a page with results matching the search criteria.
Parameters: - date – because github only allows us 1000 results because they are bad at making an api.
- page – Which page of results should be fetched.
Type: str
Type: int
Returns: void
-
offline_filtered_list
(filename, unfiltered_repos)[source]¶ Filter a list of repositories and save it to JSON for persistent usage
Parameters: - filename – The name to save the file as
- unfiltered_repos – The list of repos to filter
Type: str
Type: list
-
static
offline_read_json
(filename)[source]¶ Read in a json file
Parameters: filename – The filename to read Type: str
-
offline_results
(filename, date, start_page, end_page)[source]¶ Saves the list of all repos for offline usage.
Parameters: - filename – The name that will be given to the generated file
- date – the date to search for
- start_page – The index of the first page that should be saved
- end_page – The index of the last page that should be saved
Type: str
Type: str
Type: int
Type: int
-
-
class
decompy.DataGathering.
RepoStructure
(repo_path='Repositories', parent_dir='.')[source]¶ Bases:
object
-
class
decompy.DataGathering.
Clang
[source]¶ Bases:
object
Class to define functions for calling the Clang compiler
-
static
compile_all
(file_path, newlocation, out_type, args='')[source]¶ Compiles the C file given as a path with Clang, using the specified args. Writes to a file by calling compile_cfile then returns the specified location of the file path. If this is being used to filter the input files, the C files that successfully compile will be entered in.
Parameters: - file_path – File with list of C file names to compile
- newlocation – location to save LLVM files to
- out_type – the type that the file must be compiled to, such as “elf”
- args – Arguments for the compiler to use while compiling
-
static
compile_cfile
(file_in, newlocation, output_type, args)[source]¶ Compiles the specified C file with Clang, using the specified args. Stores this file in the specified location and returns the location as a string. If this is being used to filter the input file and if the C file successfully compiles it will be entered in the filter file.
Parameters: - file_in – File to compile
- newlocation – location to save LLVM files to
- output_type – the type that the file must be compiled to, such as “elf”
- args – Arguments for the compiler to use while compiling
-
static
to_assembly
(file_path, newlocation)[source]¶ Compiles the C file given as a path to x86 assembly. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – file path to compile
- newlocation – location to save assembly files to
Returns: the file location which llvm_unopt was saved to.
Return type: str or None
-
static
to_elf
(file_path, newlocation)[source]¶ Compiles the C file given as a path to elf executables. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – file path to compile
- newlocation – location to save LLVM files to
-
static
to_llvm_opt
(file_path, newlocation, optlevel='')[source]¶ Compiles the C file given as a path to optimized LLVM IR, at the specified opt level. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – File with list of C file names to compile
- newlocation – location to save LLVM files to
Returns: the file location which llvm_unopt was saved to.
Return type: str or None
-
static
to_llvm_unopt
(file_path, newlocation)[source]¶ Compiles the C file given as a path to unoptimized LLVM IR. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – File with list of C file names to compile
- newlocation – location to save LLVM files to
Returns: the file location which llvm_unopt was saved to.
Return type: str or None
-
static
to_object_file
(file_path, newlocation)[source]¶ Compiles the C file given as a path to an object file. Writes to a file by calling compile_cfile through compile_all then returns the specified location of the file path.
Parameters: - file_path – File with list of C file names to compile
- newlocation – location to save Object files to
Returns: the file location which llvm_unopt was saved to.
Return type: str or None
-
static
-
class
decompy.DataGathering.
FilterC
[source]¶ Bases:
object
Filters C files to our standards. This includes c header files that we find appropraite for machine learning. Filters out the maximum amount of bytes we would like in a file. As of now, this is 7000 bytes. Filters our words that may be too difficult: malloc, FILE, and threading DOES NOT CHECK IF IT SUCCESSFULLY COMPILES. This is handled by ClangSubprocess which generates a new file with file paths to use.
-
C_BLACKLIST
= ('file', 'malloc', 'realloc', 'calloc', 'free')¶
-
C_WHITELIST_HEADERS
= ('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype')¶
-
FILE_TYPE
= '.c'¶
-
MAX_BYTES
= 8000¶
-
MIN_BYTES
= 21¶
-
static
check_blacklisted_words
(line, blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶ Lowercases the line to evaluate, and returns false if any blacklisted word is found.
Param: the string from a file. Type: str Param: the blacklisted array. Type: str arr Returns: boolean
-
static
check_byte_size
(file, preferred_max_size=8000, preferred_min_size=21)[source]¶ Finds the file size and tests it against the preferred_size in bytes. The default is 7000 bytes.
Parameters: - file – the file path to test against
- preferred_max_size – the preferred size to search for, defaults to 7000.
- preferred_min_size – the preferred minimum size to search for, defaults to 21 (for int main(){return 0;})
Type: str
Type: int
Type: int
Returns: boolean
-
static
check_headers
(line, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'))[source]¶ Uses a regex to evaluate the line, ignoring the case, and returns false if any unknown header is found.
Param: the string from a file. Type: str Param: the blacklisted array, defaults to Type: str arr Returns: boolean
-
static
check_valid_data
(file, preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶ Runs validation testing on a given file string. This includes the correct byte size, predetermined whitelisted headers, and predetermined blacklisted headers.
Parameters: - file – the file the user wants to validate.
- preferred_max_size – the max byte size the user wants
- preferred_min_size – the min byte size the user wants
- whitelisted – the whitelisted headers to search for.
- blacklisted – the blacklisted words to exclude.
Type: str
Type: str
Type: int
Type: tuple or array
Type: tuple or array
Returns: bool
-
static
check_valid_folder
(folder, filt_path_name='Unfiltered', preferred_max_size=8000, preferred_min_size=21, whitelisted=('assert', 'complex', 'ctype', 'errno', 'fenv', 'float', 'inttypes', 'limits', 'locale', 'math', 'signal', 'stddef', 'stdint', 'stdio', 'stdlib', 'stdnoreturn', 'string', 'tgmath', 'time', 'wchar', 'wctype'), blacklisted=('file', 'malloc', 'realloc', 'calloc', 'free'))[source]¶ Runs check_valid_data for each file in the folder path.
Parameters: - folder – the folder the user wants to validate for each C file.
- filt_path_name – the filtered path word the user is using to store data once filtered.
- preferred_max_size – the max byte size the user wants.
- preferred_min_size – the max byte size the user wants.
- whitelisted – the whitelisted headers the user wants.
- blacklisted – the blacklisted words the user wants to exclude.
Type: str
Type: str
Type: int
Type: int
Type: tuple or list
Type: tuple or list
Returns: a list of filtered file paths
Return type: list
-
script_dir
= '/Users/Josh/Documents/DecomPy/docs'¶
-