May 4, 2011

Problem with getting the full text of an article | CyberSEO Pro | CyberSEO support forum

Avatar

Lost password?
Advanced Search

— Forum Scope —




— Match —





— Forum Options —





Minimum search word length is 3 characters - maximum search word length is 84 characters

sp_TopicIcon
Problem with getting the full text of an article
Topic Rating: 0 Topic Rating: 0 Topic Rating: 0 Topic Rating: 0 Topic Rating: 0 Topic Rating: 0 (0 votes) 
July 31, 2023
10:56 am
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline

Unable to get full text of some articles.
For example here we succeeded

Login to see this link

And here is not

Login to see this link

I am using setting
Use Full-Text RSS script

July 31, 2023
10:59 am
Avatar
CyberSEO
Admin
Forum Posts: 4194
Member Since:
July 2, 2009
sp_UserOfflineSmall Offline

The script does not guarantee that the full-text article will be extracted from each web page. Some pages have a complicated layout and it is not possible to parse them automatically. In this case it is recommended to use the container tag for article extraction as described in this article: https://www.cyberseo.net/blog/extracting-full-text-articles-using-container-tags/

July 31, 2023
11:06 am
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline

The matter is that both articles are received from the same rss channel -

Login to see this link

Also, there are errors in the browser console on this page

Login to see the quote

July 31, 2023
11:08 am
Avatar
CyberSEO
Admin
Forum Posts: 4194
Member Since:
July 2, 2009
sp_UserOfflineSmall Offline
  1. The feed could be same, but the pages may have a different layout. You can always check your feed here: Login to see this link
  2. The error mentioned in your post above is caused by the rsnportal WordPress theme and has nothing to do with CyberSEO Pro.
July 31, 2023
11:17 am
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline

Okay, I will study the configuration using the container

July 31, 2023
1:04 pm
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline

I'm trying to get the content of the page Login to see this link by specifying the attributes of the div container {"id": "block-system-main", "class": " block block-system"} but I'm getting the error --

[31-07-23 11:01:33] Processing a new post: Login to see this link
[31-07-23 11:01:33] Checking for duplicate by link
[31-07-23 11:01:33] Trying to extract full text article
[31-07-23 11:01:33] Tag specified: <div>
[31-07-23 11:01:33] Attributes specified: {"id": "block-system-main", "class": " block block-system"}
[31-07-23 11:01:34] Operation failed. Unable to retrieve full-text content from Login to see this link
[31-07-23 11:01:34] The post will not be added.

[31-07-23 11:01:34] 0 posts were added.

What am I doing wrong?

July 31, 2023
1:15 pm
Avatar
CyberSEO
Admin
Forum Posts: 4194
Member Since:
July 2, 2009
sp_UserOfflineSmall Offline

Container tag

Login to see the code

Attributes (JSON format):

Login to see the code

July 31, 2023
1:50 pm
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline

Thank you, now I can extract the text of the article.
But the original formatting is preserved in the article.

I'm trying to remove the corresponding container using attributes div {"class": "with-sidebar-first col-12 col-sm-12 col-md-12 col-lg-9 col-xl-9"} in the parameter Remove outer HTML elements but after that the encoding flies in the article.

Login to see this link

The page code looks like this Login to see this link

July 31, 2023
1:56 pm
Avatar
CyberSEO
Admin
Forum Posts: 4194
Member Since:
July 2, 2009
sp_UserOfflineSmall Offline

There is no such a class as "with-sidebar-first col-12 col-sm-12 col-md-12 col-lg-9 col-xl-9". These are 6 different classes. Also, what do you mean by "formatting"? These are just classes that do not format the article in any way. Their CSS style does it. If you don't import the CSS file, they have no effect.

If you want to remove part of the HTML code (for example, class="with-sidebar-first col-12 col-sm-12 col-md-12 col-lg-9 col-xl-9"), you can do it using the custom PHP code, as described in this article: https://www.cyberseo.net/blog/content-customization-in-wordpress-with-cyberseo-pro/

Login to see the code

You can also remove any tags like <div>, <p>, <strong>, etc: https://www.cyberseo.net/content-syndicator/#html-tags-to-strip

If you want to remove an entire container with all its contents, you should use this tool: https://www.cyberseo.net/content-syndicator/#remove-outer-html-elements

July 31, 2023
2:32 pm
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline
10sp_Permalink sp_Print
0

My page content slides to the right.
In addition, a second news heading is added.

Login to see this link

original article Login to see this link

How i can fix it?

July 31, 2023
2:38 pm
Avatar
CyberSEO
Admin
Forum Posts: 4194
Member Since:
July 2, 2009
sp_UserOfflineSmall Offline
11sp_Permalink sp_Print
0

Any part of the article, including the part described above as the second message, can be removed using the methods described in my previous post. I don't see any other problems. If there is something wrong with the formatting of the article with your theme, I suggest to modify its CSS styles or use an alternative theme. For example, your post looks absolutely correct in all standard WordPress themes. The plugin does not render the posts in the browser. Your theme does.

Also keep in mind that some web pages may have errors in their HTML structure (e.g. a missing </div>). So these posts may look ok with some HTML layouts and may be displayed weird in others. I would suggest you to try this option. If it doesn't help, just remove all <div> elements from the syndicated posts:

scr.gif

July 31, 2023
3:01 pm
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline
12sp_Permalink sp_Print
0

OK, you write this: "If you want to remove an entire container with all its contents, you should use this tool: https://www.cyberseo.net/content-syndicator/#remove-outer-html-elements"

I’m trying to remove the corresponding container using attributes div {"class": "with-sidebar-first"} in the parameter Remove outer HTML elements but after that the encoding flies in the article. I will remove this container.

As a result, the container was not removed and the encoding in the article fell off

Login to see this link

Login to see this link

What am I doing wrong?

July 31, 2023
3:21 pm
Avatar
CyberSEO
Admin
Forum Posts: 4194
Member Since:
July 2, 2009
sp_UserOfflineSmall Offline
13sp_Permalink sp_Print
0

As I mentioned above, if the imported HTML page is broken and missing some closing element, you won't be able to do anything with it using standard tools. The only way to do it is to write a regular expression for your particular case, like this:

Login to see the code

Here is a good manual on regular expressions: Login to see this link

July 31, 2023
3:32 pm
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline
14sp_Permalink sp_Print
0

Okay, is it possible to just disable autoposting for a page whose code could not be retrieved?

July 31, 2023
3:35 pm
Avatar
CyberSEO
Admin
Forum Posts: 4194
Member Since:
July 2, 2009
sp_UserOfflineSmall Offline
15sp_Permalink sp_Print
0

Yes, if you are using the Full-Text RSS script and the article has not been retrieved, the post will not be added.

July 31, 2023
3:50 pm
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline
16sp_Permalink sp_Print
0

But I see that the post is being added, for example, as here asp-news.ru/mercury/fgbu-vniizzh-continues-active-ra/

July 31, 2023
3:52 pm
Avatar
CyberSEO
Admin
Forum Posts: 4194
Member Since:
July 2, 2009
sp_UserOfflineSmall Offline
17sp_Permalink sp_Print
0

It is absolutely not informative to post links here. You should only post your Syndicator Log instead. Anything else is pointless.

In your case, the imported HTML articles are definitely broken. That's why they distort your page formatting and can't be processed by DOMXPath.

July 31, 2023
5:05 pm
Avatar
s.baryshev.aoasp
Member
Members
Forum Posts: 68
Member Since:
July 27, 2023
sp_UserOfflineSmall Offline
18sp_Permalink sp_Print
0

I have added the code $post['post_content'] .= preg_replace('//u', '

$1

', $post['post_content']); in "Custom PHP code" but the code of the page does not change

I tested this code on a test page, it works

UPD: I realized my mistake

July 31, 2023
5:17 pm
Avatar
CyberSEO
Admin
Forum Posts: 4194
Member Since:
July 2, 2009
sp_UserOfflineSmall Offline
19sp_Permalink sp_Print
0

Your code was broken when you edited the post, but if you want to replace something, you should use "=". I see ".=" above, which means a concatenation of two strings.

Forum Timezone: Europe/Amsterdam

Most Users Ever Online: 541

Currently Online:
47 Guest(s)

Currently Browsing this Page:
1 Guest(s)

Top Posters:

ninja321: 87

harboot: 75

s.baryshev.aoasp: 68

Freedom: 61

Pandermos: 54

tormodg: 51

Member Stats:

Guest Posters: 337

Members: 3126

Moderators: 0

Admins: 1

Forum Stats:

Groups: 1

Forums: 5

Topics: 1729

Posts: 8911

Newest Members:

bsbcasaesaude, vargamartin56, delvinongyj, anyday555, jetlifeguy1, seoarshinov

Administrators: CyberSEO: 4194