I am using OpenCV LineDetector class in order to parse tables. However, I face an issue when I try to detect lines inside the table. for the following image:
I use
img = cv2.imread(TABLE_PATH)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
lsd = cv2.createLineSegmentDetector(cv2.LSD_REFINE_ADV, sigma_scale=0.6)
dlines = lsd.detect(gray)
lines = (Line(x0, y0, x1, y1) for x0, y0, x1, y1 in dlines[0][:, 0])
in order to detect line segments. However, the results are lousy. these are the lines it detects:
How can I make sure that words are not detected as lines. I cannot use hardcoded thresholds since they would work for one example but not for the other. Solutions in python or java would be appreciated
You got some lines detected, but that set contained some undesirable ones.
You could just filter the set of lines for line length. If you do that, you can easily exclude the very short lines coming from the text in that picture.
Implementation: that's a list comprehension, only including lines that are long enough. Write a predicate function that gives you the length of one line, then you can use that in the list comprehension.
That is independent of how you scraped lines out of the picture. the LSD is one, but there are routines based on the Hough transform too, which might fare better or worse than what you have.
You probably also noticed that your approach didn't find some lines that it should have. You might want to tweak the parameters you pass to your line detector. Or try another line detection approach.