Notes:De-anonymizing Programmers via Code Stylometry
来源:互联网 发布:python爬虫百度云资源 编辑:程序博客网 时间:2024/05/17 01:52
Essay Information
De-anonymizing Programmers via Code Stylometry
Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss,
Fabian Yamaguchi, and Rachel Greenstadt.
Usenix Security Symposium, 2015
Source code stylometry
Everyone learns coding on an individual basis, as a result code in a
unique style, which makes de-annoymization possible.Software engineering insights
- programmer style changes while implementing sophisticated
functionality - differences in coding style of programmers with different skill sets
- programmer style changes while implementing sophisticated
Identify malicious programmers.
Scenario 1 : Who wrote this code?
Alice analyzes a library with Malicious source code.
Bob has a source code collection with known authors
Bob will search his collection to find Alice’s adversary- Scenario 2 : Who wrote this code?
Alice got an extension for her programming assignment.
Bob, the teacher has everyone else’s code.
Bob wants to see if Alice plagiarized.
Comparison to related work
Machine learning workflow
Abstract Syntax Trees (AST)
- Stylemotry can be used in source code to identify the author of a program.
- Extract layout and lexical features from source code.
- Abstract syntax trees (AST) in code represent the structure of the program.
- Preprocess source code to obtain AST.
- Parse AST to extract coding style features.
Feature Extraction
### Code Stylometry Feature Set (CSFS)
Lexical features (Extract from source code)
Layout features (Extract from source code)
Syntactic features (Extract from ASTs)
Feature Selection
WEKA’s information gain criterion, which evaluates the difference between the entropy of the distribution of classes and the entropy of the conditional distribution of classes given a particular feature:
where A is the class corresponding to an author, H is Shannon entropy, and Mi is the ith feature of the dataset.
Intuitively, the information gain can be thought of as measuring the amount of information that the observation of the value of feature i gives about the class label associated with the example.
To reduce the total size and sparsity of the feature vector, we retained only those features that individually had non-zero information gain
Random Forest Classification
Method
- Use random forest as the machine learning classifier
- avoid over-fitting
- multi-class classifier by nature
- K-fold cross validation
- Validate method on a different dataset
Future work
- Multiple authorship detection
- Multiple author identification
- Anonymizing source code
- obfuscation is not the answer
- Stylometry in executable binaries
- authorship attribution
- Notes:De-anonymizing Programmers via Code Stylometry
- Notes for C programmers
- Lotus Notes email via vbscript
- Clean Code Notes
- linuxcnc Code Notes
- CLR via C# Reading Notes(1)
- CLR via C# Reading Notes(2)
- android source code download notes
- Clean Code Study Notes 1
- Clean Code Study Notes 2
- Programming Question-5-Dijkstra Algorithm via Min-Heap (including Notes)
- Practice: run clisp code via net
- Beautiful Code: Leading Programmers Explain How They Think
- Beautiful Code: Leading Programmers Explain How They Think
- Innocent Code: A Security Wake-Up Call for Web Programmers
- The VC programming specifications - programmers should write code like this
- de
- de
- Android UI 之 Tab类型界面总结
- LeetCode 6 :ZigZag Conversion ---- 数学找规律
- 22. Generate Parentheses
- 【操作系统】UltraEdit 上FTP的配置
- oracle存储过程之游标
- Notes:De-anonymizing Programmers via Code Stylometry
- 设计模式-策略模式
- c++:最简单的动态分配
- 简单天气获取demo的制作(一)
- 如何高效的学习高等数学
- Intellij IDEA社区版打包Maven项目成war包,并部署到tomcat上
- cvGetCaptureProperty 获取视频流的各种属性 用法
- [Android]AutoCompleteTextView自动补全文本框
- HDOJ 2002 计算球体积