convore.json/groups/python/source-code-detection-library/messages.json


			
				
					
					
						
						
							
							
							[{"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299464045.5110569, "message": "I found that pygments has functionality that seems to do this, but in reality it seems it's just looking for a shebang like /usr/bin/env python", "group_id": 292, "id": 283466}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299464111.230859, "message": "So...does anyone know of a library that does this?", "group_id": 292, "id": 283469}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299463997.504684, "message": "Like, this text is 78% likely to be Python", "group_id": 292, "id": 283458}, {"user_id": 7, "stars": [], "topic_id": 11200, "date_created": 1299464043.376425, "message": "@ericflo I don't know a library that does quiet that, pygments might ahve some, or you could use a bayesian classifier and trian it with something", "group_id": 292, "id": 283465}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299463981.2979381, "message": "I'm looking for a library that, given some text, determines what kind of source code it's looking at, and gives a confidence interval", "group_id": 292, "id": 283457}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299464004.3460121, "message": "or 32% likely to be Java", "group_id": 292, "id": 283459}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299464100.823452, "message": "@alex Yeah, bayesian classifier might be what I end up needing to go with, but a library which looks for certain heuristics would be fine if it exists", "group_id": 292, "id": 283468}, {"user_id": 207, "stars": [], "topic_id": 11200, "date_created": 1299470613.6185651, "message": "but instead of combinations of letters, you look for combinations of punctuation maybe", "group_id": 292, "id": 284666}, {"user_id": 207, "stars": [], "topic_id": 11200, "date_created": 1299470641.2567379, "message": "but no, i don't know of any libraries to do this already", "group_id": 292, "id": 284820}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299471767.5391769, "message": "http://code.google.com/p/google-code-prettify/source/browse/trunk/src/prettify.js#1056", "group_id": 292, "id": 285056}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299471776.9131899, "message": "'keywords': ALL_KEYWORDS,", "group_id": 292, "id": 285058}, {"user_id": 207, "stars": [{"date_created": 1299470654.9266491, "user_id": 1}], "topic_id": 11200, "date_created": 1299470630.374378, "message": "but it'd probably be a combination of that and keywords i suppose", "group_id": 292, "id": 284817}, {"user_id": 207, "stars": [], "topic_id": 11200, "date_created": 1299470592.120827, "message": "i wonder if that would work something like n-gram analysis", "group_id": 292, "id": 284664}, {"user_id": 3456, "stars": [{"date_created": 1299470786.188216, "user_id": 1}, {"date_created": 1299508612.8732769, "user_id": 214}], "topic_id": 11200, "date_created": 1299470688.1637299, "message": "a loose bayesian filter would probably catch whatever a simple unique-keyword detection routine misses", "group_id": 292, "id": 284822}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299470883.848541, "message": "@igorgue Yeah, I came across that, looks really good actually", "group_id": 292, "id": 284852}, {"user_id": 11972, "stars": [{"date_created": 1299470869.5967219, "user_id": 1}, {"date_created": 1299749816.8429141, "user_id": 14438}], "topic_id": 11200, "date_created": 1299470862.3405299, "message": "@ericflo If you're going to implement it, it might be worth to take a look at http://code.google.com/p/google-code-prettify/", "group_id": 292, "id": 284848}, {"user_id": 207, "stars": [], "topic_id": 11200, "date_created": 1299470936.7816811, "message": "guh, i just checked the ohloh source code, and they do have language detection, but it's in C and a complete mess", "group_id": 292, "id": 284857}, {"user_id": 1736, "stars": [], "topic_id": 11200, "date_created": 1299472625.2384911, "message": "GitHub tracks stats on this kind of thing, but I think I remember they just check the file suffix.", "group_id": 292, "id": 285236}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299471764.7747769, "message": "Haha, I just had a read through the google-code-prettify code.  I thought it was doing detection, but it turns out it's just applying *super generic* rules if it can't detect the language.", "group_id": 292, "id": 285055}, {"user_id": 11972, "stars": [], "topic_id": 11200, "date_created": 1299479261.4997771, "message": "@ericflo lol, that's funny, who knows then, I wonder if there is a good solution for this. Meaning pretty much you have to analyze the data, you know, write a parser lex yacc...", "group_id": 292, "id": 285908}, {"user_id": 11592, "stars": [], "topic_id": 11200, "date_created": 1299484612.626565, "message": "check source for code highliters and just run all of them against the code reporting bad keywords and/or tokens", "group_id": 292, "id": 286129}, {"user_id": 1243, "stars": [], "topic_id": 11200, "date_created": 1299498113.3458371, "message": "@ericflo [highlight.js](http://softwaremaniacs.org/soft/highlight/en/ ) has some sort of detection, though I'm not sure how it works.", "group_id": 292, "id": 287006}, {"user_id": 1509, "stars": [], "topic_id": 11200, "date_created": 1299511840.22616, "message": "@ericflo pygments has a function ` analyse_text(text)` to analyze code. Maybe it is what you're looking for http://pygments.org/docs/api/#lexers", "group_id": 292, "id": 288261}, {"user_id": 548, "stars": [{"date_created": 1299614964.3810091, "user_id": 13501}, {"date_created": 1299741965.9437051, "user_id": 11592}, {"date_created": 1299763146.91976, "user_id": 16058}, {"date_created": 1299883143.795769, "user_id": 1081}], "topic_id": 11200, "date_created": 1299604503.2474861, "message": "Do what Google did with google voice... make a convore paste site for code with language pull down and use the data to train your classifier.  :)", "group_id": 292, "id": 297575}, {"user_id": 13501, "stars": [], "topic_id": 11200, "date_created": 1299614774.781013, "message": "so with the new feature, what did you do, @ericflo?", "group_id": 292, "id": 299032}, {"user_id": 13501, "stars": [], "topic_id": 11200, "date_created": 1299613940.2509871, "message": "@ericflo 78 + 32 ... don't ignore those lisp users", "group_id": 292, "id": 298909}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299615416.2041171, "message": "@itsnotvalid Just didn't do syntax highlighting", "group_id": 292, "id": 299106}, {"user_id": 13501, "stars": [], "topic_id": 11200, "date_created": 1299615464.4358051, "message": "@ericflo sorry for my poor level of English, I meant which solution did you go after?", "group_id": 292, "id": 299112}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299615500.4860289, "message": "@itsnotvalid None at all, we're not detecting source code, we're saving that for later", "group_id": 292, "id": 299117}, {"user_id": 4219, "stars": [{"date_created": 1299696978.3384299, "user_id": 1}, {"date_created": 1299704659.8415461, "user_id": 3580}, {"date_created": 1299837908.6189809, "user_id": 4858}, {"date_created": 1299941094.2890129, "user_id": 6618}, {"date_created": 1311209704.6573811, "user_id": 32520}], "topic_id": 11200, "date_created": 1299682874.4385569, "message": "i need a library that detects bad code and then auto-corrects.", "group_id": 292, "id": 305400}, {"user_id": 19652, "stars": [], "topic_id": 11200, "date_created": 1299698372.6576631, "message": "Pygments stuff goes a bit beyond just checking the code but it ain't a big improvement. Maybe try to talk to the guys for ohloh.net they have the code for it and I know they use python.", "group_id": 292, "id": 307602}, {"user_id": 13912, "stars": [], "topic_id": 11200, "date_created": 1299706163.5036631, "message": "but I think that what you're looking for describes what a bayesian classifier does, so you'd basically need to tweak one to tokenize (extract the features) from the text not as words but something that'd make sense for PL detection", "group_id": 292, "id": 308554}, {"user_id": 13912, "stars": [], "topic_id": 11200, "date_created": 1299706227.395715, "message": "it's an interesting problem I may want to tackle when I have some free time", "group_id": 292, "id": 308564}, {"user_id": 13912, "stars": [], "topic_id": 11200, "date_created": 1299706081.8927929, "message": "general purpose bayesian classifiers built for text are not something I'd expect to do this job well", "group_id": 292, "id": 308538}, {"user_id": 11039, "stars": [], "topic_id": 11200, "date_created": 1299722568.8457301, "message": "Though programming languages and natural language are, the bulk of the time, extremely different from a lexical point of view. I might try, as a first step, deriving an N-grammatic break-down of entered text. That is, pick a value for N (somewhere between 2 and 6) and break the text down into chunks of N characters in a sliding-window way. Then train the classifier on the eigenvectors of Ngram frequencies.\n\nSo, for a 3gram of \"this is a string,\" you'd start recursing down and spit out [\"thi\", \"his\", \"is \", \"s a\", ...], count occurrences or whatnot, or alternately build a dictionary with the frequency of the 3grams (ie. {\"thi\":1, \"his\":1, ...}) and toss that down into whatever system you want. The longer the Ngram, the fewer unique Ngrams you'll see / the higher frequency counts'll be. Picking the breakpoint on those can be fairly trial-and-error, but it works.", "group_id": 292, "id": 309964}, {"user_id": 207, "stars": [], "topic_id": 11200, "date_created": 1299726642.275774, "message": "i'd just worry that variables names would throw it off too much, since they're not unique to any language", "group_id": 292, "id": 310361}, {"user_id": 207, "stars": [], "topic_id": 11200, "date_created": 1299726673.0636261, "message": "some common conventions (like \"self\", in python) might show up and help out a bit, but others (like \"this\") wouldn't", "group_id": 292, "id": 310365}, {"user_id": 207, "stars": [], "topic_id": 11200, "date_created": 1299726620.1734221, "message": "i had wondered about that approach myself", "group_id": 292, "id": 310358}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299740286.2046859, "message": "@ericflo If I were you I wouldn't do ngrams.", "group_id": 292, "id": 311494}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299740273.7802401, "message": "@ericflo Applying *super generic* rules if you can't figure out the language would be enough I think.", "group_id": 292, "id": 311491}, {"user_id": 11039, "stars": [], "topic_id": 11200, "date_created": 1299739902.7644501, "message": "Which is why in your training pool, you include samples of multiple languages you want to be detected and multiple corpus of non-code to train on for the non-code detection. Things like indentation, etc, generally set English fairly apart at the 3gram and 4gram level. \n\n\"Are you seeing lots of multiple parens and spaces sequentially? Then you're probably looking at code,\" may be the Baysian classifier comes away with, but -- by and large -- its true.", "group_id": 292, "id": 311482}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299740293.735697, "message": "for source code at least", "group_id": 292, "id": 311495}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299740384.0613899, "message": "I would use a kind of tokenizer works similarly to a normal code tokenizer similar to one a compiler would use.", "group_id": 292, "id": 311502}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299740451.935818, "message": "getting something like \"$ab\" \"ab\" \"ba\" \"ra\" \"ac\" ... would be almost useless to a basian classifier for this kind of thing.", "group_id": 292, "id": 311504}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299740458.854893, "message": "Whenever I get around to building this the first approach I'll take is probably going to be based on keywords.", "group_id": 292, "id": 311505}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299740482.832423, "message": "you want \"$\" \"abracadabra\" and ignore the one with no symbols.", "group_id": 292, "id": 311506}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299740484.1809299, "message": "Yah,", "group_id": 292, "id": 311507}, {"user_id": 4383, "stars": [{"date_created": 1299741118.1218319, "user_id": 1}], "topic_id": 11200, "date_created": 1299740719.9768419, "message": "I would also try not to get carried away with. Just implement it until it's good enough. like say 70~80% accurate and give it a high threshhold (like the top choice needs to be 2x the 2nd choice) and fall back to generic or no highlighting otherwise.", "group_id": 292, "id": 311521}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299741136.6229229, "message": "Oh yeah, totally. Actually I'm not even working on it right now.", "group_id": 292, "id": 311546}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299740770.9677899, "message": "There is a never ending list of things you could do to try and make it more accurate and you could spend forever on it if you don't stop yourself. ;p", "group_id": 292, "id": 311523}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299741457.421257, "message": "Lucky, I wish I was going :(", "group_id": 292, "id": 311575}, {"user_id": 4383, "stars": [], "topic_id": 11200, "date_created": 1299741176.6831441, "message": "Well, yah, you are at the conference right? You shouldn't be working to hard anyway ;p", "group_id": 292, "id": 311551}, {"user_id": 1, "stars": [], "topic_id": 11200, "date_created": 1299741250.3014369, "message": "@ianmlewis Not quite yet, I fly out tomorrow.", "group_id": 292, "id": 311556}, {"user_id": 13501, "stars": [], "topic_id": 11200, "date_created": 1299808064.9569321, "message": "@SquidLord that is similar to highlight.js where they use a heuristic of counting which language scores the most", "group_id": 292, "id": 318599}, {"user_id": 8558, "stars": [], "topic_id": 11200, "date_created": 1299879519.7725241, "message": "If you have samples of known source codes in languages, say a..k, and an unknown source code x, then concatenate x to each of a..k, yielding ax..kx, and gzip each file. If, say, fx has the best compression, then x is likely to be of same language as f.", "group_id": 292, "id": 327493}, {"user_id": 11039, "stars": [], "topic_id": 11200, "date_created": 1299883198.5197871, "message": "@kseistrup That's brilliant! Horribly roundabout, but brilliant; coding minimal if the source files are of moderate size and you have the gzip libraries handy.", "group_id": 292, "id": 328277}, {"user_id": 20745, "stars": [], "topic_id": 11200, "date_created": 1299985440.23648, "message": "I don't think the gzip method is valid since it uses a sliding window (32kB?). You would only get a meaningful comparison for the text immediately surrounding the junction point.", "group_id": 292, "id": 338699}, {"user_id": 8558, "stars": [], "topic_id": 11200, "date_created": 1299994129.56423, "message": "Perhaps not, but all methods have their limitations.  But if the sliding window is a problem, how about interleaving files x and y?  In, say, C that would leave all the \u201cinclude <someheader.h>\u201d at the top, in Python \u201cimport somemodule\u201d would appear at the top, and thus the resultant file would compress well if x and y are written in the same language.", "group_id": 292, "id": 339069}, {"user_id": 7479, "stars": [], "topic_id": 11200, "date_created": 1310881578.2107849, "message": "@Vasil want to team up and try it?", "group_id": 292, "id": 1655697}]