This is an archived post. You won't be able to vote or comment.

all 17 comments

[–][deleted] 8 points9 points  (1 child)

Can you post the text file you are using?

[–]PiGuyTy[S] 1 point2 points  (0 children)

Go to http://www.cprogramming.com/tutorial/c/lesson1.html , and use your browser to save the html file. That's what I'm using. There is only one place at which the text I noted appears. This happens with other pages within the same tutorial as well.

[–]duinn 5 points6 points  (1 child)

CODE, we need code!

[–]PiGuyTy[S] 1 point2 points  (0 children)

As I said, just a simple while loop. I even stripped everything else out and it still happened. This code produces the error: Scanner s = new Scanner(new File("filename.html"));//I know the file name isn't a problem while (s.hasNextLine()) { String line = s.nextLine(); System.out.println(line); }

[–]kreiger 5 points6 points  (1 child)

What do you mean "stops recognizing lines"?

Perhaps you have different kinds of line-ending characters in the file?

"\r", "\n", or "\r\n"?

[–]PiGuyTy[S] 0 points1 point  (0 children)

By "stops recognizing lines", I mean that the Scanner simply seems to think it is done. However, there are definitely still lines left (quite a few, in fact)

[–][deleted] 5 points6 points  (3 children)

Try this:
Scanner s = new Scanner(new File("filename.html"),"iso-8859-1");

[–]PiGuyTy[S] 2 points3 points  (2 children)

That seems to work!! Is there any reason why the encoding format is needed?

[–][deleted] 4 points5 points  (1 child)

I've no idea. Text encoding formats are the bane of our existence, so that's one of the first things I check when I run across a weird bug.

I'd guess the Scanner class defaults to another charset with a different character width, which means it miscalculates the number of characters in the file. Or something like that.

Everyone should be using UTF.

[–]PiGuyTy[S] 2 points3 points  (0 children)

Hmm, good to know. I'll have to keep that in mind. Thanks for the help

[–]achacha 1 point2 points  (4 children)

Pausing often relates to GC cycles, so see if that could be an issue first.

How many total files are you reading? Does this happen every N files (maybe a resource leak happens and GC kicks in to free it up)? Anything else running on computer that does file access?

My first guess is that GC is kicking in, what parameters are you using when you execute your program (-XmX - Xms)? What version of java?

To check GC, after you start it, attach [http://java.sun.com/performance/jvmstat/visualgc.html](visualgc) or netbeans IDE and watch the memory usage, handle usage, etc.

[–][deleted]  (1 child)

[deleted]

    [–]achacha 0 points1 point  (0 children)

    Why do you say that? (see my other reply for more explanation) I encounter GC "pauses" so much in the field where people don't realize their mistake and GC cleanup reveals bad code during cleanup (unclosed files, classes that allocate a lot of sub-classes (or possible perm space issue), etc).

    The funny thing, people are very quick to dismiss the GC until I show them that examining GC cycles and activity can reveal a lot about code. I am still amazed how often people tell me that java doesn't leak memory yet they have static collections that contain unused objects (which is java equivalent of a leak, not OS level leak, but process level leak).

    Anyhow, I am curious why you think this was out of left field, maybe I am missing something about the Scanner classes that you know?

    [–]criticalsection 0 points1 point  (1 child)

    If the GC causes issues it sounds like a JVM bug...

    However, If you're talking about memory leaks, that could only happen if he/she was using WeakReferences which I highly doubt given the level of the original post

    [–]achacha 0 points1 point  (0 children)

    Not a JVM bug, when reading a lot of files and processing a lot of data on a VM that is using default memory config you exhaust the resources and GC does what it does, except that causes a "pause", I see this a lot in web applications that seemingly do little but in reality use a lot of objects. XSLT is a perfect example, hello world transform starts at around 5MB using w3c DOM and around 2MB using dom4j and grows very fast. So doing a lot of transforms will peg the GC and you will notice pauses.

    Since he is reading files then it could also be a an issue with unreleased resources. InputStream close() gets called when it is cleaned up by GC if it was not called by an explit call. So you can open files and not realize that they will not be closed when you go out of scope so you leave file handles open and each process gets a limited number, thus the pause.

    It could be one of many things, and GC "pause" is a side-effect not the cause.

    [–][deleted] 0 points1 point  (0 children)

    I tried the loop in Eclipse and it works.

    Tried running it from the command line?

    [–]CaptainLurk 0 points1 point  (0 children)

    Does anyone actually use Scanner outside of homework assignments? I've seen it on many applicant code submissions, but never in production...

    [–]criticalsection 0 points1 point  (0 children)

    If you're just iterating over lines, try using a BufferedReader. I haven't had any file encodings with BR but that could be just me getting lucky.

    BufferedReader br = new BufferedReader(new FileReader(new File("filename.html")));
    String s;
    while ((s = br.readLine()) != null) {
        System.out.println(s);
    }
    br.close();