1920 Commits

Author SHA1 Message Date
esaunders
ff85d494ed Tightened up IP address regex to not match hits that have more than 4 elements separated by dots. Removed trailing slash from URL regex as it is not needed...the trailing slash is being handled as one of the boundary characters in RegexQuery 2017-01-09 16:39:53 -05:00
Eugene Livis
09014d34b6 Registering as service providers 2017-01-09 15:58:50 -05:00
Eugene Livis
e3ed9dfc34 Resolved merge conflicts 2017-01-09 14:08:14 -05:00
Eugene Livis
7e291b1124 <erge latest 2017-01-09 10:37:35 -05:00
Eugene Livis
75441946a1 Minor 2017-01-09 10:23:03 -05:00
millmanorama
860f4361b3 Merge branch '2132-32k-chunks' into 2184-overlapping-chunks
# Conflicts:
#	KeywordSearch/src/org/sleuthkit/autopsy/keywordsearch/Ingester.java
2017-01-09 11:51:36 +01:00
millmanorama
a8cfbd1e10 refactor TextExtractor to an interface and to remove the intermediary getInputStream() method 2017-01-09 11:07:12 +01:00
millmanorama
7325174dc3 Merge remote-tracking branch 'upstream/develop' into 2132-32k-chunks 2017-01-09 10:47:59 +01:00
Richard Cordovano
be7bdced90 Merge in develop branch with text extraction refactoring 2017-01-08 10:48:17 -05:00
Richard Cordovano
837eb1477f Pull in text extraction refactoring and resolve merge conflicts 2017-01-08 10:22:18 -05:00
Richard Cordovano
b0ce3168df Merge pull request #2434 from millmanorama/fix-compiler-warnings
fix compiler warnings about raw types
2017-01-07 11:10:24 -05:00
Richard Cordovano
8fbb19a67d Merge remote-tracking branch 'upstream/develop' into search_improvements 2017-01-07 10:32:34 -05:00
Richard Cordovano
5463d3a719 Remove kws public Server.getIngester, Ingester is not public 2017-01-07 10:30:58 -05:00
Eugene Livis
4299a2326e More work 2017-01-06 16:22:49 -05:00
Eugene Livis
b5e3639167 Fixing comments 2017-01-06 16:18:45 -05:00
Eugene Livis
d23b78f57c Fixing comments 2017-01-06 16:17:42 -05:00
Eugene Livis
bb0c3e55eb Fixing comments 2017-01-06 16:14:53 -05:00
Eugene Livis
21f2efbdcf More work 2017-01-06 16:05:11 -05:00
Eugene Livis
b05dded08a Got inex folder search algorithm to work 2017-01-06 15:48:44 -05:00
millmanorama
161ba2098c cleanup and comments in Chunker 2017-01-06 14:53:58 +01:00
millmanorama
990433fc36 refactor Chunker read methods to use a common helper method. 2017-01-06 13:16:40 +01:00
millmanorama
64ba5f6e66 Merge remote-tracking branch 'upstream/develop' into 2184-overlapping-chunks 2017-01-06 11:09:27 +01:00
millmanorama
52251bcb2e move Reader reset back to beginning of next() and increase buffer size to 2048. 2017-01-06 00:03:45 +01:00
Eugene Livis
40cc726a11 First cut at integrating AutopsyServiceProvider 2017-01-05 17:16:27 -05:00
Eugene Livis
7d252864a4 Index folder finding algorithm seems to work 2017-01-05 13:41:12 -05:00
Eugene Livis
4555f7d44d Merge branch 'search_improvements' of https://github.com/sleuthkit/autopsy into solr65 2017-01-05 12:42:55 -05:00
Richard Cordovano
210068e241 Merge in develop branch 2017-01-05 10:24:59 -05:00
esaunders
acae764760 Modified the phone number regex to pick up number that have spaces in them...perhaps this will produce more false positives but in our test data it produces over 1,000 extra numbers that are not found in Autopsy 4.2. Also updated the email regex to find email addresses surrounded in {} sometimes seen in academic publications. 2017-01-04 17:14:06 -05:00
esaunders
ba7f8ab9b3 Consolidated boundary characters into a single list. 2017-01-04 15:19:21 -05:00
esaunders
c172e0f16e Fix for missing characters in snippets and reduce length of snippets in an attempt to more closely match previous version of Autopsy. 2017-01-04 12:23:42 -05:00
esaunders
8432fec205 Updated email and url regexes to be case insensitive. 2017-01-04 12:22:17 -05:00
millmanorama
5e0f9abdf9 reset at end to avoid "This stream has not been marked" error. 2017-01-04 17:22:56 +01:00
millmanorama
151742c21b record length in chars and mark/reset reader to produce overlaps 2017-01-04 17:16:20 +01:00
millmanorama
d8ec4290f2 reduce max window size to prevent off by one error 2017-01-04 17:16:19 +01:00
millmanorama
94e136b451 first pass at overlapping chunks 2017-01-04 17:16:17 +01:00
millmanorama
d14c15fbdb bump chunk size to exactly 32k, single read chars to 1024 2017-01-04 12:25:34 +01:00
Eugene Livis
62ad3e1eb2 First cut of index search algorithm 2017-01-03 16:51:57 -05:00
esaunders
6304300f62 Merge branch 'develop' of github.com:sleuthkit/autopsy into 2121_regex_query 2017-01-03 12:56:04 -05:00
esaunders
45c2b0c065 Set results max page size to 512. 2017-01-03 12:48:39 -05:00
esaunders
c1f326775a Added result paging support. 2017-01-03 12:47:16 -05:00
millmanorama
8410970b11 Chunker implements Iterator and Iterable 2017-01-03 14:57:55 +01:00
millmanorama
15c2d395fa move Chunk and Chunker out of Ingester 2017-01-03 14:26:48 +01:00
millmanorama
d2a6fe3fda move chunking algorithm into seperate class(es) and reduce chunk size to ~32k 2017-01-03 14:26:46 +01:00
Richard Cordovano
46369eff44 Update NBM versioning for 4.3.0 2017-01-02 18:45:21 -05:00
Richard Cordovano
13411450aa 4.3.0 preps: DSPs, public API restore, const name 2017-01-02 17:36:59 -05:00
millmanorama
3557f141e1 use UTF-8 encoding for ArtifactTextExtractor streams and readers 2017-01-02 16:45:51 +01:00
millmanorama
4ae0a688bc don't commit unnecessarily 2016-12-31 14:31:11 +01:00
esaunders
681699467d Needed to tweak the CC regex and our boundary characters to successfully match CC numbers in our test data set. 2016-12-28 14:37:51 -05:00
millmanorama
8526427b4f cleanup and comment TextExtractor
cleanup and comment TextExtractor immplementations more.

remove constants left over from merge
2016-12-28 17:30:42 +01:00
millmanorama
f56c2b43c8 move all 'appendix' related code into TikaTextExtractor and simplify TextExtractor interface. 2016-12-28 17:30:32 +01:00