1 The code in this directory makes up the "git data miner," a simple hack
2 which attempts to figure things out from the revision history in a git
8 gitdm is a python script and doesn't need to be properly installed like other
9 normal programs. You just have to adjust your PATH variable, pointing it to
10 the directory of gitdm or alternatively create a symbolic link of the script
13 Before actually runnning gitdm you may want also to update the configuration
14 file (gitdm.config) with the needed information.
21 git log -p -M [details] | gitdm [options]
23 Alternatively, you can run with:
25 git log --numstat -M [details] | gitdm -n [options]
27 The [details] tell git which changesets are of interest; the [options] can
30 -a If a patch contains signoff lines from both Andrew Morton
31 and Linus Torvalds, omit Linus's.
33 -b dir Specify the base directory to fetch the configuration files.
35 -c file Specify the name of the gitdm configuration file.
36 By default, "./gitdm.config" is used.
38 -d Omit the developer reports, giving employer information
41 -D Rather than create the usual statistics, create a file (datelc.csv)
42 providing lines changed per day, where the first column displays
43 the changes happened only on that day and the second sums the day
44 it happnened with the previous ones. This option is suitable for
45 feeding to a tool like gnuplot.
47 -h file Generate HTML output to the given file
49 -H file Export individual developer raw data as CSV. These data could be
50 used to evaluate the fidelity of developers.
52 -l num Only list the top <num> entries in each report.
54 -n Use --numstat instead of generated patches to get the statistics.
56 -o file Write text output to the given file (default is stdout).
58 -p prefix Dump out the database categorized by changeset and by file type.
59 It requires -n, otherwise it is not possible to get separated results.
61 -r pat Only generate statistics for changes to files whose
62 name matches the given regular expression.
64 -s Ignore Signed-off-by lines which match the author of
67 -t Generate a report by type of contribution (code, documentation, etc.).
68 It requires -n, otherwise this option is ignored silently.
70 -u Group all unknown developers under the "(Unknown)"
73 -x file Export raw statistics as CSV.
75 -w Aggregate the data by weeks instead of months in the
76 CSV file when -x is used.
78 -z Dump out the hacker database to "database.dump".
80 A typical command line used to generate the "who wrote 2.6.x" LWN articles
83 git log -p -M v2.6.19..v2.6.20 | \
84 gitdm -u -s -a -o results -h results.html
88 git log --numstat -M v2.6.19..v2.6.20 | \
89 gitdm -u -s -a -n -o results -h results.html
93 The main purpose of the configuration file is to direct the mapping of
94 email addresses onto employers. Please note that the config file parser is
95 exceptionally stupid and unrobust at this point, but it gets the job done.
97 Blank lines and lines beginning with "#" are ignored. Everything else
98 specifies a file with some sort of mapping:
102 Developers often post code under a number of different email
103 addresses, but it can be desirable to group them all together in
104 the statistics. An EmailAliases file just contains a bunch of
107 alias@address canonical@address
109 Any patches originating from alias@address will be treated as if
110 they had come from canonical@address.
112 It may happen that some people set their git user data in the
113 following form: "joe.hacker@acme.org <Joe Hacker>". The
114 "Joe Hacker" is then considered as the email... but gitdm says
115 it is a "Funky" email. An alias line in the following form can
116 be used to alias these commits aliased to the correct email
119 "Joe Hacker" joe.hacker@acme.org
124 Map email addresses onto employers. These files contain lines
127 [user@]domain employer [< yyyy-mm-dd]
129 If the "user@" portion is missing, all email from the given domain
130 will be treated as being associated with the given employer. If a
131 date is provided, the entry is only valid up to that date;
132 otherwise it is considered valid into the indefinite future. This
133 feature can be useful for properly tracking developers' work when
134 they change employers but do not change email addresses.
137 GroupMap file employer
139 This is a variant of EmailMap provided for convenience; it contains
140 email addresses only, all of which are associated with the given
148 This construct (which appears in the main configuration file)
149 allows causes the creation of a fake employer with the given
150 "name". It directs that any contributions attributed to that
151 employer should be split to other (real) employers using the given
152 percentages. The functionality works, but is primitive - there is,
153 for example, no check to ensure that the percentages add up to
158 Map file names/extensions onto file types. These files contain lines
161 order <type1>,<type2>,...,<typeN>
163 filetype <type> <regex>
166 This construct allows fine graned reports by type of contribution
167 (build, code, image, multimedia, documentation, etc.)
169 Order is important because it is possible to have overlapping between
170 filenames. For instance, ltmain.sh fits better as 'build' instead of
171 'code' (the filename instead of '\.sh$'). The first element in order
172 has precedence over the next ones.
177 A few other tools have been added to this repository:
180 Reads a set of commits, then generates a graphviz file charting the
181 flow of patches into the mainline. Needs to be smarter, but, then,
182 so does everything else in this directory.
185 Simple brute-force crawler which outputs the names of any files
186 which have not been touched since the original (kernel) commit.
189 I needed to be able to quickly associate a given commit with the
190 major release which contains it. First attempt used
191 "git tags --contains="; after it ran for a solid week, I concluded
192 there must be a better way. This tool just reads through the repo,
193 remembering tags, and creating a Python dictionary containing the
194 association. The result is an ugly 10mb pickle file, but, even so,
195 it's still a better way.
198 Crawls through a directory hierarchy, counting how many lines of
199 code are associated with each major release. Needs the pickle file
200 from committags to get the job done.
205 Gitdm was written by Jonathan Corbet; many useful contributions have come
206 from Greg Kroah-Hartman.
208 Please note that this tool is provided in the hope that it will be useful,
209 but it is not put forward as an example of excellence in design or
210 implementation. Hacking on gitdm tends to stop the moment it performs
211 whatever task is required of it at the moment. Patches to make it less
212 hacky, less ugly, and more robust are welcome.